Switching Fabrics with Small Internal Buffers (ICS-FORTH)

Buffered switching fabrics (bibliography) are an efficient method for building packet switches with a very large number of ports. To keep the cost down, buffers inside the fabric should be small, with large buffers being necessary only at the input ports. We research flow control, congestion management, and scheduling methods to make such buffered switching fabrics operate as efficiently as possible. Our focus is on the Benes network, which is the lowest-cost topology that is known to yield operation free of internal blocking.

1. Switching Fabrics with Small Internal Buffers

Nikolaos I. Chrysos: "Request-Grant Scheduling for Congestion Elimination in Multistage Networks", Doctoral Dissertation, Univ. of Crete (advisor: M. Katevenis), December 2006; Technical Report FORTH-ICS/TR-388, Inst. of Computer Science, FORTH, Heraklio, Crete, Greece, April 2007; 185+42 = 227 pages.
Available in PDF (1.6 MBytes); © Copyright 2007 FORTH.

ABSTRACT: This thesis considers buffered multistage interconnection networks (fabrics), and investigates methods to reduce their buffer size requirements. Our contribution is a novel flow and congestion control scheme that achieves performance close to that of per-flow queueing while requiring much less buffer space than what per-flow queues would need. The new scheme utilizes a request-grant pre-approval phase, as many contemporary bufferless networks do, but its operation is much simpler and its performance is remarkably better. Traditionally, the role of requests in bufferless networks is to reserve an available time slot on each link along a packet's route, where these time slots are contiguous in time along the path, so as to guarantee non-conflicting packet transmission. These requirements impose a very heavy toll on the scheduling unit of such bufferless fabrics. By contrast, our requests do not reserve links for a specific time duration, but instead only reserve space in the buffers at their entry points; effectively, the scheduling decisions that concern different links are decoupled among themselves, leading to a much simpler admission process. The proposed scheduling subsystem comprises independent single-resource schedulers, operating in a pipeline; they operate asynchronously to each other. In this thesis we show that the reservation of buffers in front of critical network links --links that are unable to carry the potential aggregate demand-- eliminates congestion, in the sense that traffic flows seamlessly through the network: it neither gets dropped, nor is excessively blocked waiting for downstream buffers to become available.

First, we apply request-grant scheduling to a single-stage switch, with small, shared output queues, which serves as a model for the more challenging multistage case. We demonstrate that, in principle, a very small number of fabric buffers suffices to reach high performance levels: with 12-cell buffer space per output, performance is better than in buffered crossbars, which consume N cells of buffer space per output, where N is the number of ports. In this single-stage setting, we study the impact of input contention on scheduler performance, and the related synchronization phenomena. During this work, we have introduced a novel scheduling scheme for buffered crossbar switches that makes buffer size independent of the round-trip-time between the linecards and the switch.

We then proceed to the multistage case. Our main motivation and our primary benchmark is an example next-generation fabric challenge: a 1024x1024, 3-stage, non-blocking Clos/Benes fabric, running with no internal speedup, made of 96 single-chip 32x32 buffered crosssbar switching elements (3 stages of 32 switch chips each). To eliminate congestion in the fabric, we carefully apply our request-grant scheduling protocol. We demonstrate that it is feasible to implement all schedulers centrally, in a single chip. Besides congestion elimination, our scheduler can guarantee 100 percent in-order delivery, using very small reorder buffers, which can easily fit in on-chip memory. Simulation results indicate very good delay performance, and throughput that exceeds 95% under unbalanced traffic. Most prominent is the result that, under hot-spot traffic, with almost all output ports being congested, the non-congested outputs experience negligible delay degradation. The proposed system can directly operate on variable-size packets, eliminating the padding overhead and the associated internal speed-up. We also discuss a possible distributed version of the scheduling subsystem. Our scheme is appropriate to deal with heavy congestion; in systems that need to provide very low latency under (uncongested) light traffic, one would apply this scheme when the load exceeds a given threshold.

Lastly, we consider some blocking network topologies, like the banyan. In a banyan network, besides output ports, internal links can cause congestion as well. We show a fully distributed scheduler for this network, that eliminates congestion from both internal and output-port links

N. Chrysos, M. Katevenis: "Scheduling in Switches with Small Internal Buffers", Proc. IEEE Globecom 2005 Conference, St. Louis, MO, USA, 28 Nov. - 2 Dec. 2005, 6 pages, CDROM paper ID gc21_3;
- Preprint of September 2005 in PDF (300 KBytes); 6 pages - © Copyright 2005 by IEEE.

ABSTRACT:

and

N²

two

[Previous, outdated version: July 2005, 7 pages, in pdf (250 KB)].

Extended Version:
N. Chrysos, M. Katevenis: "Scheduling in Switches with Small Internal Buffers: Extended Version", FORTH-ICS, Crete, Greece, September 2005
- Available in PDF (420 KBytes); 12 pages - © Copyright 2005 by FORTH.
This extended version describes the results of the above Globecom 2005 paper at greater depth. It also discusses the behavior of the switch under bursty traffic (unbalanced transients), and demonstrates the performance gains of using grant rate control (grant throttling). The results presented in this paper further substantiate that buffers as small as 12 cells per output suffice for excellent performance.

N. Chrysos, M. Katevenis: "Scheduling in Non-Blocking Buffered Three-Stage Switching Fabrics", Proc. IEEE Infocom 2006 Conference, Barcelona, Spain, 23-29 Apr. 2006, 13 pages, CDROM paper ID 13_01.
- Preprint of January 2006 in PDF (460 KBytes) - © Copyright 2006 by IEEE.

ABSTRACT:

(i)

(ii)

(iii)

(iv)

(v)

(vi)

(vii)

(viii)

[Previous, outdated version: August 2005, 15 pages, in pdf (460 KB)].

N. Chrysos, M. Katevenis: "Preventing Buffer-Credit Accumulations in Switches with Shared Small Output Queues", Proc. IEEE Workshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, 7-9 June 2006, pp. 409-416, ISBN 0-7803-9570-0.
- Preprint of April 2006 in PDF (430 KBytes); 8 pages - © Copyright 2006 by IEEE.

ABSTRACT:

2. Benes Fabrics with Internal Backpressure

G. Sapountzis, M. Katevenis: "Benes Switching Fabrics with O(N)-Complexity Internal Backpressure", IEEE Communications Magazine (special theme on "Next Generation Switching and Routing"), vol. 43, no. 1, January 2005, pp. 88-94.

G. Sapountzis, M. Katevenis: "Benes Switching Fabrics with O(N)-Complexity Internal Backpressure", Proceedings of the IEEE Workshop on High Performance Switching and Routing (HPSR 2003), Torino, Italy, June 2003, pp. 11-16.
(Similar to the Communications Magazine, Jan. 2005, paper)
- Available in PDF (180 KBytes) or Postscript (290 KBytes) or gzip'ed Postscript (85 KBytes) format; © Copyright 2003 IEEE.
- The slides of the presentation are also available, in PDF format (190 KBytes).

ABSTRACT: Multistage buffered switching fabrics are the most efficient method for scaling packet switches to very large numbers of ports. The Benes network is the lowest-cost switching fabric known to yield operation free of internal blocking. Backpressure inside a switching fabric can limit the use of expensive off-chip buffer memory to just virtual-output queues (VOQ) in front of the input stage. This paper extends the known backpressure architectures to the Benes network. To achieve this, we had to successfully combine per-flow backpressure, multipath routing (inverse multiplexing), and cell resequencing. We present a flow merging scheme that is needed to bring the cost of backpressure down to O(N) per switching element. We prove freedom from deadlock for a wide class of multipath cell distribution algorithms. Using a cell-time-accurate simulator, we verify operation free of internal blocking, we evaluate various cell distribution and resequencing methods, we compare performance to that of ideal output queueing and the iSLIP crossbar scheduling algorithm, and we show that the delay of well-behaved flows remains unaffected by the presence of congested traffic to oversubscribed output ports.

G. Sapountzis, M. Katevenis: "Benes Switching Fabrics with O(N)-Complexity Internal Backpressure", Inst. of Computer Science, FORTH, Heraklio, Crete, Greece, July 2003, 11 pages.
Available in PDF (150 KBytes) or Postscript (450 KBytes) or gzip'ed Postscript (125 KBytes) format; © Copyright 2003 FORTH.

This is a longer version of the above HPSR-2003 paper. In addition to the more detailed presentation, it also contains the proof of freedom from deadlock.

Georgios Sapountzis: "Benes Switching Fabrics with O(N)-Complexity Internal Backpressure", Technical Report FORTH-ICS/TR-316, Inst. of Computer Science, FORTH, Heraklio, Crete, Greece; M.Sc. Thesis, Univ. of Crete (advisor: M. Katevenis); December 2002, 12+64 = 76 pages.
Available in PDF (500 KBytes); © Copyright 2002 FORTH.

This is the full-length version of our research; it covers in more details the topics covered by the above papers.
In addition, this thesis uses simplified models of the Benes fabric to analytically show that cell distribution does not result in throughput limitations, and to identify the points of the fabric where contention is resolved. This latter point may be the key to a future design of appropriate distributed scheduling algorithms that provide weighted max-min fairness for the overall system.

Acknowledgements:
Part of this work was supported by an IBM PhD Fellowship to Nikos Chrysos, 2004-2006. Vasilios Siris, Dionisios Pnevmatikatos, and Paraskevi Fragopoulou helped us shape our ideas; we thank them.

Older (outdated) papers:

G. Sapountzis, M. Katevenis: "Benes Switching Fabrics with O(N)-Complexity Internal Backpressure", Inst. of Computer Science, FORTH, Heraklio, Crete, Greece, August 2002, 12 pages.
Available in PDF (260 KBytes) or Postscript (490 KBytes) or gzip'ed Postscript (135 KBytes) format; © Copyright 2002 FORTH.

[Superseded by the July 2003 version, above].

G. Sapountzis, M. Katevenis: "Benes Fabrics with Internal Backpressure: First Work-in-Progress Report", Technical Report FORTH-ICS/TR-303, Inst. of Computer Science, FORTH, Heraklio, Crete, Greece, March 2002, 27 pages.
Available in Postscript (1.1 MBytes) or gzip'ed Postscript (200 KBytes) format; © Copyright 2002 FORTH.

[Superseded by TR-316, above].

Manolis Katevenis: "Wide Links: 2logN-Stage Non-Blocking Networks with In-Order Packet Delivery", Inst. of Computer Science, FORTH, Heraklio, Crete, Greece, June 1995, unpublished internal working paper.

[This (unfinished) paper presented our August 1994 work on how to use multilane backpressure in multipath (Benes) fabrics and how to merge flows so as to keep the number of lanes manageable, while guaranteeing freedom from deadlock and maintaining non-blocking operation.]

© Copyright 2002-2007 by FORTH or the IEEE:
These papers are protected by copyright. Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the FORTH or IEEE copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Foundation for Research & Technology -- Hellas (FORTH) or the IEEE. To copy otherwise, in whole or in part, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.

Up to Packet Switch Architecture R&D at CARV-ICS-FORTH

Last updated: Apr. 2007, by M. Katevenis.

Switching Fabrics with Small Internal Buffers for Scalable Non-Blocking Interconnects

1. Switching Fabrics with Small Internal Buffers

2. Benes Fabrics with Internal Backpressure

Switching Fabrics with Small Internal Buffers
for Scalable Non-Blocking Interconnects