Fast Parallel Comparison Circuits for Scheduling
Kostas Harteros and
Manolis Katevenis
Computer Architecture and VLSI Systems (CARV) Lab,
Institute of Computer Science (ICS), FORTH
Science and Technology Park of Crete,
P.O.Box 1385, Heraklion, Crete, GR 711 10 Greece
© copyright 2002 by FORTH, Crete, Greece
Project and Result Outline:
In order to provide advanced Quality of Service (QoS) architectures
in high speed networks,
Weighted Fair Queueing (WFQ) is needed.
We have been working since 1985 in this area, i.e. on:
(i)
per-flow queueing, and
(ii)
weighted round-robin (WRR) scheduling.
WRR scheduling is generally implemented using priority queues:
for each flow, a priority value is maintained
--oftentimes corresponding to the next time-to-service value--
and the scheduler chooses the minimum among these values,
corresponding e.g. to the closest next time-to-service.
Quickly finding the smallest number
in a set of hundreds or thousands of priority values
will often require hardware support.
The form of such support depends on
how fast the set of eligible flows changes
(eligible flows are flows that have non-empty queues
and are not stopped due to flow control):
-
Few eligibility changes per cell time.
In this case,
implementations based on the Heap data structure
are often advantageous: see our work on
(Pipelined) Heap Management in Hardware.
An alternative is to use
per-class urgency counters,
when the weight factors are picked
from a relatively small ``menu'' of allowed factor values.
These solutions apply for example
to the ingress line cards of many switches,
where at most one flow per cell time may become eligible,
when the arriving cell is enqueued into an empty queue,
and at most one flow per cell time may become ineligible,
when the departing cell leaves its queue empty.
-
Potentially many eligibility changes per cell time.
This is the harder case,
where heaps cannot be taken advantage of,
because each eligibility change results in
one heap insertion or deletion operation.
We deal with this case in the present work.
This case may arise in ingress line cards
that receive "multilane" backpressure from the switch,
which selectively turns ON or OFF
many flows or classes of flows at once.
This case also arises in
buffered crossbar switches,
where multiple crosspoint buffers may change state at once.
When the set of eligible flows varies arbitrarily fast,
and we need to quickly find the minimum
among the priority values of the eligible flows,
"brute force" comparisons are needed between all priority values.
To avoid the N-square cost
of an all-to-all comparison architecture,
we use a binary tree of comparators.
We developed an innovative organization for the tree of comparators,
where signals are propagated in parallel:
- from the MS bit to the LS bit of each comparator in the tree; and
- from the leaf level to the root level of the tree.
In this way,
the delays of the individual comparators
and the delays of the tree levels
are placed in parallel, rather than in series.
We also studied the comparison operation extensively,
and we developed appropritate fast compare-and-mux circuits.
Each such circuit receives the bits of the input numbers
at different times,
starting from the MS bit and progressing to the LS bit,
and produces the bits of the smaller of the two numbers
at correspondingly different times;
the MS bits of the result are produced
before the LS bits of the inputs are looked at.
We find that a comparator organization
analogous to the carry-select adder
generaly yields the fastest comparison tree.
To achieve this,
we modified the traditional carry-select add/subtract organization,
which proceeds from LS to MS bit,
so that it operates in the reverse direction,
from MS to LS bit.
We designed comparison trees of various types and sizes,
as well as a WRR scheduler built around such trees,
in synthesisable Verilog hardware-description language.
We also described our designs in C code, for verification purposes.
We performed synthesis, placement, and routing
for a 0.18-micron CMOS process,
and we present delay, area, and power consumption figures
for various tree sizes and various widths of the comparators
(number of bits for each priority value).
Our results show that, for example,
the minimum of 256 twenty-four-bit integer numbers
can be found in 4.5 ns,
in the 0.18-micron semi-custom CMOS technology that we used.
In the same technology and for the same carry-select-style comparator,
the comparison of two 24-bit numbers takes 1.15 ns.
Given that the binary tree with 256 leaves has 8 levels of comaprators,
if individual comparators and tree levels
were not operating in parallel,
the find-minimum operation would need
8 times 1.15, i.e. 9.2 nanoseconds,
as compared to 4.5 ns using our architecture.
Papers:
-
Kostas G.I. Harteros:
"Fast Parallel Comparison Circuits for Scheduling",
Technical Report FORTH-ICS/TR-304,
Inst. of Computer Science, FORTH, Heraklio, Crete, Greece;
M.Sc. Thesis, Univ. of Crete (advisor: M. Katevenis);
March 2002, 78 pages.
Available in
Postscript
(1 MByte) or
gzip'ed Postscript
(650 KBytes) format;
© Copyright 2002 FORTH.
© Copyright 2002 by FORTH:
These papers are protected by copyright.
Permission to make digital/hard copies
of all or part of this material without fee
is granted provided that the copies are made for personal use,
they are not made or distributed for profit or commercial advantage,
the FORTH copyright notice,
the title of the publication and its date appear,
and notice is given that copying is by permission
of the Foundation for Research & Technology -- Hellas (FORTH).
To copy otherwise, in whole or in part,
to republish, to post on servers, or to redistribute to lists,
requires prior specific written permission and/or a fee.