Publications by Keith D. Underwood
2017
Proceedings of the 25th Annual Symposium on High-Performance Interconnects (HOTI'17), Santa Clara, CA, USA, August 2017
The advent of non-volatile memory (NVM) technologies has added an interesting nuance to the node level memory hierarchy. With modern 100 Gb/s networks, the NVM tier of storage can often be slower than the high performance network in the system; thus, a new challenge arises in the datacenter. Whereas prior efforts have studied the impacts of multiple sources targeting one node (i.e., incast) and have studied multiple flows causing congestion in inter-switch links, it is now possible for a single flow from a single source to overwhelm the bandwidth of a key portion of the memory hierarchy. This can subsequently spread to the switches and lead to congestion trees in a flow-controlled network or excessive packet drops without flow control. In this work we describe protocols which avoid overwhelming the receiver in the case of a source/sink rate mismatch. We design our protocols on top of Portals 4, which enables us to make use of network offload. Our protocol yields up to 4× higher throughput in a 5k node Dragonfly topology for a permutation traffic pattern in which only 1% of all nodes have a memory write-bandwidth limitation of 1/8th of the network bandwidth.
@inproceedings{abc, abstract = {The advent of non-volatile memory (NVM) technologies has added an interesting nuance to the node level memory hierarchy. With modern 100 Gb/s networks, the NVM tier of storage can often be slower than the high performance network in the system; thus, a new challenge arises in the datacenter. Whereas prior efforts have studied the impacts of multiple sources targeting one node (i.e., incast) and have studied multiple flows causing congestion in inter-switch links, it is now possible for a single flow from a single source to overwhelm the bandwidth of a key portion of the memory hierarchy. This can subsequently spread to the switches and lead to congestion trees in a flow-controlled network or excessive packet drops without flow control. In this work we describe protocols which avoid overwhelming the receiver in the case of a source/sink rate mismatch. We design our protocols on top of Portals 4, which enables us to make use of network offload. Our protocol yields up to 4{\texttimes} higher throughput in a 5k node Dragonfly topology for a permutation traffic pattern in which only 1\% of all nodes have a memory write-bandwidth limitation of 1/8th of the network bandwidth.}, author = {Timo Schneider and James Dinan and Mario Flajslik and Keith D. Underwood and Torsten Hoefler}, booktitle = {Proceedings of the 25th Annual Symposium on High-Performance Interconnects (HOTI{\textquoteright}17)}, title = {Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches}, venue = {Santa Clara, CA, USA}, year = {2017} }
2016
IEEE Micro, July 2016
Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities, such as lossless transmission and remote direct memory access, that are now ubiquitous in high-performance systems. Prototypes of next-generation network cards now offer new features that facilitate device programming. In this article, the authors discuss an abstract machine model for offloading architectures. They used the Portals 4 network interface to implement the proposed abstraction model, and they present two microbenchmarks to show the effects of fully offloaded collective communications. They then propose the concept of persistent offloaded operations that can reduce the creation/offloading overhead, and they discuss a possible extension to the current Portals 4 interface to enable their support. The results obtained show how this work can be used to accelerate existing MPI applications.
@article{abc, abstract = {Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities, such as lossless transmission and remote direct memory access, that are now ubiquitous in high-performance systems. Prototypes of next-generation network cards now offer new features that facilitate device programming. In this article, the authors discuss an abstract machine model for offloading architectures. They used the Portals 4 network interface to implement the proposed abstraction model, and they present two microbenchmarks to show the effects of fully offloaded collective communications. They then propose the concept of persistent offloaded operations that can reduce the creation/offloading overhead, and they discuss a possible extension to the current Portals 4 interface to enable their support. The results obtained show how this work can be used to accelerate existing MPI applications.}, author = {Salvatore Di Girolamo and Pierre Jolivet and Keith D. Underwood and Torsten Hoefler}, pages = {6-17}, journal = {IEEE Micro}, title = {Exploiting Offload Enabled Network Interfaces}, volume = {36}, year = {2016} }
2015
TOPC, July 2015
@article{abc, author = {Torsten Hoefler and James Dinan and Rajeev Thakur and Brian W. Barrett and Pavan Balaji and William Gropp and Keith D. Underwood}, journal = {TOPC}, title = {Remote Memory Access Programming in MPI-3.}, url = {http://doi.acm.org/10.1145/2780584}, year = {2015} }