Publications by Maciej Besta
2018
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.
@inproceedings{abc, abstract = {Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.}, author = {Maciej Besta and Syed M. Hassan and Sudhakar Yalamanchili and Rachata Ausavarungnirun and Onur Mutlu and Torsten Hoefler}, booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, title = {Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability}, venue = {Williamsburg, VA, USA}, year = {2018} }
2017
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, November 2017
Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of p1/3 less communication on p processors than the best known alternatives, for graphs with n vertices and average degree k = n/p2/3. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems.
@inproceedings{abc, abstract = {Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of p1/3 less communication on p processors than the best known alternatives, for graphs with n vertices and average degree k = n/p2/3. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems.}, author = {Edgar Solomonik and Maciej Besta and Flavio Vella and Torsten Hoefler}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, title = {Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication}, venue = {Denver, CO, USA}, year = {2017} }
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, Washington, DC, USA, June 2017
We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.
@inproceedings{abc, abstract = {We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.}, author = {Maciej Besta and Michal Podstawski and Linus Groner and Edgar Solomonik and Torsten Hoefler}, booktitle = {Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing}, title = {To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations}, venue = {Washington, DC, USA}, year = {2017} }
Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, June 2017
Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.
@inproceedings{abc, abstract = {Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50\%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33\%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.}, author = {Maciej Besta and Florian Marending and Edgar Solomonik and }, booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)}, title = {SlimSell: A Vectorized Graph Representation for Breadth-First Search}, venue = {Orlando, FL, USA}, year = {2017} }
2016
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, Kyoto, Japan, June 2016
We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.
@inproceedings{abc, abstract = {We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81\% and 73\%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.}, author = {Patrick Schmid and Maciej Besta and Torsten Hoefler}, booktitle = {Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing}, title = {High-Performance Distributed RMA Locks}, venue = {Kyoto, Japan}, year = {2016} }
2015
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA, June 2015
@inproceedings{abc, author = {Maciej Besta and Torsten Hoefler}, booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA}, title = {Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages.}, url = {http://doi.acm.org/10.1145/2749246.2749263}, year = {2015} }
Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 2015
@inproceedings{abc, author = {Maciej Besta and Torsten Hoefler}, booktitle = {Proceedings of the 29th ACM on International Conference on Supercomputing, ICS{\textquoteright}15, Newport Beach/Irvine, CA, USA}, title = {Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations.}, url = {http://doi.acm.org/10.1145/2751205.2751219}, year = {2015} }