Memory and Communication in Multicore Systems

64 Formic boards in a 4x4x4 (3D) mesh interconnection, connected to 2 ARM systems and 1 XUP board

1. Most Recent Work Overview

S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, D. Tsaliagkos, M. Katevenis, D. Pnevmatikatos and D. Nikolopoulos: "Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures", Proc. of the IEEE 20th Int. Symposium on Field-Programmable Custom Computing Machines (FCCM'12), Toronto Canada, May 2012, pp. 61-64; DOI: 10.1109/FCCM.2012.20
- Preprint in PDF (2.1 MBytes); © Copyright 2012 by IEEE.

For more information about our Formic board and its use for prototyping large multicore systems, please visit: http://formic-board.com

ABSTRACT:

M. Katevenis, V. Papaefstathiou, S. Kavadias, D. Pnevmatikatos, F. Silla, and D. S. Nikolopoulos: "Explicit Communication and Synchronization in SARC", To appear in IEEE Micro Magazine (IEEE Micro), Special Issue: European Multicore Processing Projects, September/October 2010.
- Preprint in PDF (640 KBytes); © Copyright 2010 by IEEE.

ABSTRACT:

G. Kalokairinos, V. Papaefstathiou, G. Nikiforos, S. Kavadias, M. Katevenis, D. Pnevmatikatos, and X. Yang: "Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability", To appear in Transactions on High-Performance Embedded Architectures and Compilers (Transactions on HiPEAC), Special Issue: SAMOS2009 Best Papers, Springer Verlag LNCS 2010.
- Preprint in PDF (370 KBytes); © Copyright 2010 by Springer.

ABSTRACT: We present the hardware design and implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). Our memory system supports both implicit communication via caches, and explicit communication via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks that lie near each processor, so that portions of them operate as 2nd level (local) cache, while the rest operate as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardware synchronization primitives: counters, and queues. We describe the NI design, the hardware cost, and the latencies of our FPGA-based prototype implementation that integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller. One-way, end-to-end, user-level communication completes within about 20 clock cycles for short transfer sizes.

The prototype includes multiple Xilinx XUPV5 processor boards, containing 4 MicroBlaze cores per board, interconnected via a Xilinx ML325 switch board that contains 3 parallel crossbars, using 3 RocketIO (2.5 Gbps) links per board.

2. Support for Explicit Communication and Synchronization

S. Kavadias, M. Katevenis, M. Zampetakis, and D. S. Nikolopoulos: "On-chip Communication and Synchronization Mechanisms with Cache-Integrated Network Interfaces", Proc. 7th ACM International Conference on Computing Frontiers (CF-2010), Bertinoro, Italy, 17-19 May 2010, pp. 217-226, http://doi.acm.org/10.1145/1787275.1787328 ISBN: 978-1-4503-0044-5 (ranked, by the PC Co-Chairs, as one of the top three papers of the Conference)
- Preprint in PDF (390 KBytes); © Copyright 2010 by ACM.

ABSTRACT:

M. Katevenis: "Replicate and Migrate Objects in the Runtime, not Cache Lines or Pages in Hardware", Invited Talk at the Barcelona Multicore Workshop 2010 (BMW 2010), Barcelona, Spain, 21-22 Oct. 2010.
- Slides available in PDF (1.5 MBytes); © Copyright 2010 by FORTH.

ABSTRACT:

M. Katevenis: "Towards Unified Mechanisms for Inter-Processor Communication", Keynote Presentation at the IEEE Int. Conf. on Embedded Computer Systems: Architectures, Modeling and Simulation (IC-SAMOS2008), Samos, Greece, 21-24 July 2008.
- Slides available in PDF (130 KBytes); © Copyright 2008 by FORTH.

M. Katevenis: "Interprocessor Communication seen as Load-Store Instruction Generalization", in The Future of Computing, essays in memory of Stamatis Vassiliadis, K. Bertels e.a. Editors, Delft, The Netherlands, 28 Sep. 2007, pp. 55-68.
- Available in PDF (3.7 MBytes) - Slides in PDF (40 KBytes); © Copyright 2007 by FORTH.

C. Villavieja, M. Katevenis, N. Navarro, D. Pnevmatikatos, A. Ramirez, S. Kavadias, V. Papaefstathiou, and D. S. Nikolopoulos: "Hardware Support for Explicit Communication in Scalable CMP's", Technical Report UPC-DAC-RR-CAP-2009-1, UPC, BSC, and FORTH-ICS, January 2009.
- Available in PDF (420 KBytes); © Copyright 2009 by UPC, BSC, and FORTH.

3. Hardware Prototypes for Interprocessor Communication Mechanisms

3.1 Tightly-coupled Network Interfaces (2008-2010)

G. Kalokairinos, V. Papaefstathiou, G. Nikiforos, S. Kavadias, M. Katevenis, D. Pnevmatikatos, and X. Yang: "FPGA Implementation of a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability", Proc. IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (IC-SAMOS2009), Samos, Greece, 20-23 July 2009, ISBN 978-1-4244-4501-1, pp. 149-156.
- Preprint in PDF (440 KBytes) © Copyright 2009 by IEEE; Slides in PDF (550 KBytes) © Copyright 2009 by FORTH.

This conference paper is extended and superseeded by the Transactions of HiPEAC journal paper.

G. Nikiforos, G. Kalokairinos, V. Papaefstathiou, S. Kavadias, D. Pnevmatikatos, and M. Katevenis, "A run-time Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability", In the 6th HiPEAC Industrial Workshop on Embedded Computing, THALES Research and Development - Palaiseau, Paris, France, 26 November 2008.
- Available in PDF (270 KBytes) - Slides in PDF (210 KBytes) © Copyright 2008 by FORTH.

This paper is superseeded by the Transactions of HiPEAC journal paper.

3.2 Loosely-coupled Network Interfaces (2006-2007)

V. Papaefstathiou, D. Pnevmatikatos, M. Marazakis, G. Kalokairinos, A. Ioannou, M. Papamichael, S. Kavadias, G. Mihelogiannakis, and M. Katevenis: "Prototyping Efficient Interprocessor Communication Mechanisms", Proc. IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (IC-SAMOS2007), Samos, Greece, 16-19 July 2007.
- Preprint in PDF (130 KBytes); © Copyright 2007 by IEEE.

ABSTRACT: Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.

DiniGroup Xilinx VII-Pro FPGA-based Prototype

The prototype includes eight x86 nodes, each with a 10Gbps PCI-X RDMA-capable NIC (DiniGroup Virtex-II Pro boards), interconnected via four Xilinx ML325 switch boards (variable-size buffered crossbars), using four RocketIO (2.5 Gbps) links per node.

V. Papaefstathiou, G. Kalokairinos, A. Ioannou, M. Papamichael, G. Mihelogiannakis, S. Kavadias, E. Vlahos, D. Pnevmatikatos, and M. Katevenis: "An FPGA-based Prototyping Platform for Research in High-Speed Interprocessor Communication", In the 2nd HiPEAC Industrial Workshop on Embedded Computing, Philips (NXP), Eindhoven, Netherlands, 17 October 2006.
- Available in PDF (200 KBytes) - Slides in PDF (1 MByte) © Copyright 2006 by FORTH.

This paper is superseeded by the IEEE IC-SAMOS 2007 conference paper.

4. Other Papers, Posters, and Related Work

C. Kachris, G. Nikiforos, V. Papaefstathiou, S. Kavadias, and M. Katevenis: "Low-latency Explicit Communication and Synchronization in Scalable Multi-core Clusters", Short paper and poster presented at the IEEE International Conference on Cluster Computing (CLUSTER2010), Hersonissos, Crete, Greece, 20-24 September 2010.
- Preprint in PDF (430 KBytes) © Copyright 2010 by IEEE.
- Poster in PDF (660 KBytes) © Copyright 2010 by FORTH.

M. Katevenis, V. Papaefstathiou, S. Kavadias, G. Nikiforos, D. Pnevmatikatos, D. Nikolopoulos, and C. Kachris: "Explicit Communication and Synchronization in SARC", Poster presented at the HiPEAC Innovation Event, Edinburgh, UK, 3-5 May 2010 (ranked 3rd out of 19 in the poster competition).
- Available in PDF (290 KBytes); © Copyright 2010 by FORTH.

M. Marazakis, V. Papaefstathiou, and A. Bilas: "Optimization and Bottleneck Analysis of Network Block I/O in Commodity Storage Systems", In Proc. 21th ACM International Conference on Supercomputing (ICS2007), Seattle, Washington, USA, 16-20 June 2007.
- Preprint in PDF (210 KBytes); © Copyright 2007 by ACM.

M. Marazakis, V. Papaefstathiou, G. Kalokairinos, and A. Bilas: "Experiences from Debugging a PCI-X-based RDMA-capable NIC", In the 3rd Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies (RAIT2006) - In conjunction with IEEE International Conference on Cluster Computing (CLUSTER2006), Barcelona, Spain, 25-28 September 2006.
- Preprint in PDF (120 KBytes); © Copyright 2006 by IEEE.

M. Marazakis, K. Xinidis, V. Papaefstathiou, and A. Bilas, "Efficient Remote Block-level I/O over an RDMA-capable NIC", In Proc. 20th ACM International Conference on Supercomputing (ICS2006), Queensland, Australia, 28 June - 1 July 2006.
- Preprint in PDF (110 KBytes); © Copyright 2006 by ACM.

5. Past Work on IPC: The Telegraphos Project (1993-97)

Telegraphos -- from the Greek words ``tele'' (remote) and ``grapho'' (write) -- was a project on low-latency, high-throughput interprocessor communication. During that project, in 1993-1997, at FORTH-ICS CARV Laboratory, workstation clustering prototypes were designed and built, including processor-network interfaces for remote-write based, protected, user-level communication.

Projects - Funding - Acknowledgements

This work is currently (2010-2012) being conducted mostly within the ENCORE (#248647) project on "ENabling technologies for a programmable many-CORE", and in cooperation with the TEXT (#261580) project, both funded by the European Union FP7 Programme. In the period 2006-2009, this work was conducted mostly within the SARC European integrated project on "Scalable computer ARChitecture", funded by the European Union FP6 Programme (#027648). Financial support, especially for hardware prototyping, was also provided by the FP6 Marie-Curie project UNiSIX (MC #509595). Our work in general, and the ENCORE and SARC projects in particular, are within the framework of the HiPEAC Network of Excellence.

Angelos Bilas, Alex Ramirez, and Georgi Gaydadjiev helped us shape our ideas; we deeply thank them. We also thank, for their participation and assistance: M. Ligerakis, M. Marazakis, M. Papamichael, E. Vlahos, G. Mihelogiannakis, and A. Ioannou.

We also deeply thank the Xilinx University Program for donating to us a number of FPGA chips, boards, and licences for the Xilinx EDA tools.

© Copyright 2006-2012 by IEEE or ACM or Springer or FORTH:
These papers are protected by copyright. Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the IEEE or ACM or Springer or FORTH copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the IEEE or of the ACM or of the Springer or of the Foundation for Research & Technology - Hellas (FORTH), as appropriate. To copy otherwise, in whole or in part, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.

Scalable Multicore Systems: Interprocessor Communication and Memory Architecture

1. Most Recent Work Overview

2. Support for Explicit Communication and Synchronization

3. Hardware Prototypes for Interprocessor Communication Mechanisms

3.1 Tightly-coupled Network Interfaces (2008-2010)

3.2 Loosely-coupled Network Interfaces (2006-2007)

4. Other Papers, Posters, and Related Work

5. Past Work on IPC: The Telegraphos Project (1993-97)

Projects - Funding - Acknowledgements

Scalable Multicore Systems:
Interprocessor Communication and Memory Architecture