Publications by Salvatore Di Girolamo
2017
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, November 2017
Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.
@inproceedings{abc, abstract = {Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today{\textquoteright}s network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.}, author = {Torsten Hoefler and Salvatore Di Girolamo and Konstantin Taranov and Ron Brightwell}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, title = {sPIN: High-performance streaming Processing in the Network}, venue = {Denver, CO, USA}, year = {2017} }
Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, May 2017
The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI-3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI, a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI libraries.
@inproceedings{abc, abstract = {The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI-3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI, a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI libraries.}, author = {Salvatore Di Girolamo and Flavio Vella and Torsten Hoefler}, booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)}, title = {Transparent Caching for RMA Systems}, venue = {Orlando, FL, USA}, year = {2017} }
2016
IEEE Micro, July 2016
Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities, such as lossless transmission and remote direct memory access, that are now ubiquitous in high-performance systems. Prototypes of next-generation network cards now offer new features that facilitate device programming. In this article, the authors discuss an abstract machine model for offloading architectures. They used the Portals 4 network interface to implement the proposed abstraction model, and they present two microbenchmarks to show the effects of fully offloaded collective communications. They then propose the concept of persistent offloaded operations that can reduce the creation/offloading overhead, and they discuss a possible extension to the current Portals 4 interface to enable their support. The results obtained show how this work can be used to accelerate existing MPI applications.
@article{abc, abstract = {Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities, such as lossless transmission and remote direct memory access, that are now ubiquitous in high-performance systems. Prototypes of next-generation network cards now offer new features that facilitate device programming. In this article, the authors discuss an abstract machine model for offloading architectures. They used the Portals 4 network interface to implement the proposed abstraction model, and they present two microbenchmarks to show the effects of fully offloaded collective communications. They then propose the concept of persistent offloaded operations that can reduce the creation/offloading overhead, and they discuss a possible extension to the current Portals 4 interface to enable their support. The results obtained show how this work can be used to accelerate existing MPI applications.}, author = {Salvatore Di Girolamo and Pierre Jolivet and Keith D. Underwood and Torsten Hoefler}, pages = {6-17}, journal = {IEEE Micro}, title = {Exploiting Offload Enabled Network Interfaces}, volume = {36}, year = {2016} }