© copyright 1998-2002 by FORTH, Crete, Greece
In order to provide advanced Quality of Service (QoS) architectures in high speed networks, it is commonly agreed, nowadays, that one needs (i) per-flow queueing, and (ii) weighted round-robin (WRR) scheduling; the combination of the two is frequently referred to as Weighted Fair Queueing (WFQ). This Laboratory has been active in research and development work on both of these technologies since 1985. For our per-flow queueing work, see that link; our work on (weighted) round-robin scheduling is described below.
Our work started in 1985 with an investigation of several techniques for performing round-robin scheduling at high speed among a large number of flows. Later, in 1990, we designed a WRR scheduler for hundreds of flows, using a content-addressable memory (CAM) in CMOS VLSI. More recently, since 1996, we developed techniques for WRR scheduling among many thousands of flows at a relatively low cost, or at very high speed; these are described below.
We simulated the performance of this scheduling algorithm, and found that the average service-time jitter never exceeded 2 % of the service time. The heart of this scheduler (the urgency counter and decision logic) fits in less than one third of an Altera 10K40 FPGA; with a 50 MHz clock, it can service 4 million events per second, which corresponds to an aggregate I/O line throughput of 1.6 Gbps in the case of scheduling ATM cells.
For more details, follow this link.
We looked at methods to manage heap data structures at high speed, using specialized hardware. First, we analyzed implementations involving one FPGA and external SRAM, and found that a few tens of clock cycles per heap operation are needed for priority queues containing several thousand elements. Next, we proposed a novel algorithm for top-to-bottom heap insertions, thus enabling one to manage a heap in a pipelined fashion, in order to achieve very high speed. We designed such a pipelined heap manager in synthesizable Verilog form, as a core integratable into ASIC's. It can be configured to any heap size, and a range of cost-performance design points are possible. Performance is up to about one operation per clock cycle. For a 16K entry example in 0.18-micron CMOS technology, silicon area is approximately 15 mm2 and performance suffices for 10 to 40 Gbps line cards.
For more details, follow this link.
We developed an innovative organization for the tree of comparators, where signals are propagated in parallel along the bits of each comparator as well as across the levels of the tree; in this way, the delays of the individual comparators and the delays of the tree levels are placed in parallel, rather than in series. To use this technique with the fastest comparator organization possible, we modified the traditional carry-select add/subtract concept, which proceeds from LS to MS bit, so that it applies to comparisons, and so that it operates in the reverse direction, from MS to LS bit. Using our technique, an 8-level tree of 24-bit comparators in 0.18-micron semi-custom CMOS technology finds the minimum of 256 numbers in just 4.5 ns, twice as fast as a traditional comparator tree, and fast enough for line rates in excess of 40 Gbits per second.
For more details, follow this link.
ABSTRACT: We present the architecture of a general-purpose B-ISDN switch chip and in particular its novel feature: the cell (packet) multiplexing algorithm and its implementation in hardware. The chip is intended to be connected, via its bit-parallel links, to other similar switch chips or to link-interface chips, providing maximum flexibility in building small or large networks. Our chip implements buffering, routing, flow control, cell scheduling, and cut-through all in hardware. We present a detailed design of how our chip implements its scheduling function consisting of a weighted round-robin multiplexing scheme, using a counter, frequency weights stored in reverse bit order, and a content-addressable memory. This algorithm allocates the available bandwidth to the virtual circuits that can use it, in proportion to the prescribed weights. Other features are chip-to-chip flow control with a built-in window mechanism, a pool of shared buffers, and a set of dedicated buffers, thus offering powerful buffer management capabilities. The above features are key for our chip to successfully switch traffic with different service requirements such as real-time traffic and non-interactive data traffic; the mechanisms built into the chip can be used by the network manager to offer guaranteed service performance to the real-time traffic, and to fully utilize the spare capacity of the links by serving the lower priority traffic without allowing congestion (fairly allocating buffers and bandwidth). We are currently designing this chip for a full-custom CMOS technology; its crucial parts have been laid out and simulated, thus proving its feasibility.
ABSTRACT: A new switching architecture is proposed, based on the tradeoffs of modern VLSI technology -- inexpensive memory and two-dimensional layout structures. Today, it is economically feasible to preallocate buffer space individually to each virtual circuit in every node, so that ``congestion'' ceases to have negative effects. On the contrary, when some low-priority circuits offer more traffic than the network can carry, full utilization of the link bandwidth is achieved. In this context, the allocation of bandwidth can be done automatically and in a ``fair'' way, if packets are multiplexed by circularly scanning all virtual circuits and transmitting one packet from each ``ready'' circuit. This multiplexing algorithm equally distributes all the available link BW to all the VC's that can use it (other than equal distribution is also possible), while it also guarantees an upper bound for the total packet delay through non-congested VC's (VC's that use less than their share of BW). We present methods for hardware implementation of fast such circular scans, and propose a structure for the switching nodes of such networks, consisting of a cross-bar arrangement like a systolic array that performs merge-sorting. It is ideally suited for physical layout on printed-circuit boards or with wafer-scale integration.
Up to Packet Switch Architecture R&D at CARV-ICS-FORTH | © Copyright by FORTH. Last updated: March 2002 (minor: Sep. 2005), by M. Katevenis. |