Low-Latency Networks-on-Chip (NoC)
Computer Architecture and VLSI Systems (CARV) Laboratory,
Institute of Computer Science (ICS), FORTH,
Heraklion, Crete, Greece
G. Michelogiannakis, D. Pnevmatikatos, M. Katevenis:
"Approaching Ideal NoC Latency with Pre-Configured Routes",
Proc. 1st ACM/IEEE Int. Symposium on Networks-on-Chips
Princeton, NJ, USA, 7-9 May 2007;
- Preprint of March 2007 in
PDF (130 KBytes) or
PS (400 KBytes);
10 pages - © Copyright 2007 by IEEE.
- Presentation Slides in
PDF (400 KBytes) -
© Copyright 2007 by FORTH.
In multi-core ASICs, processors and other compute
engines need to communicate with memory blocks and
other cores with latency as close as possible to the
ideal of a direct buffered wire. However, current
state of the art networks-on-chip (NoCs) suffer, at
best, latency of one clock cycle per hop.
We investigate the design of a NoC that offers close
to the ideal latency in some preferred, run-time
configurable paths. Processors and other compute
engines may perform network reconfiguration to
guarantee low latency over different sets of paths as
needed. Flits in non-preferred paths are given lower
priority than flits in preferred ones, and suffer a
delay of one clock cycle per hop when there is no contention.
To achieve our goal, we use the `mad-postman'
technique: every incoming flit is eagerly (i.e.
speculatively) forwarded to the input's preferred
output, if any. This is accomplished with the mere
delay of a single pre-enabled tri-state driver. We
later check if that decision was correct, and if not,
we forward the flit to the proper output.
Incorrectly forwarded flits are classified as dead
and eliminated in later hops.
We use a 2D mesh topology tailored for
processor-memory communication, and a modified version
of XY routing that remains deadlock-free.
Our evaluation shows that, for the preferred paths,
our approach offers typical latency around 500 ps
versus 1500 ps for a full clock cycle or 135 ps for an
ideal direct connect, in a 130 nm technology;
non-preferred paths suffer a one clock cycle delay per
hop, similar to that of other approaches. Performance
gains are significant and can be proven greatly useful
in other application domains as well.