# Publications by Edgar Solomonik

### 2017

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, November 2017

Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of p1/3 less communication on p processors than the best known alternatives, for graphs with n vertices and average degree k = n/p2/3. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems.

@inproceedings{abc, abstract = {Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of p1/3 less communication on p processors than the best known alternatives, for graphs with n vertices and average degree k = n/p2/3. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems.}, author = {Edgar Solomonik and Maciej Besta and Flavio Vella and Torsten Hoefler}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, title = {Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication}, venue = {Denver, CO, USA}, year = {2017} }

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, Washington, DC, USA, June 2017

We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.

@inproceedings{abc, abstract = {We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.}, author = {Maciej Besta and Michal Podstawski and Linus Groner and Edgar Solomonik and Torsten Hoefler}, booktitle = {Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing}, title = {To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations}, venue = {Washington, DC, USA}, year = {2017} }

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, Washington, DC, USA, June 2017

Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires \Theta(\sqrt{c}) less interprocessor communication than previously known algorithms, for any c\leq p^{1/3} when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(\log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.

@inproceedings{abc, abstract = {Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires \Theta(\sqrt{c}) less interprocessor communication than previously known algorithms, for any c\leq p^{1/3} when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(\log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.}, author = {Edgar Solomonik and Grey Ballard and James Demmel and Torsten Hoefler}, booktitle = {Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures}, title = {A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem}, venue = {Washington, DC, USA}, year = {2017} }

Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, June 2017

Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.

@inproceedings{abc, abstract = {Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50\%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33\%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.}, author = {Maciej Besta and Florian Marending and Edgar Solomonik and }, booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)}, title = {SlimSell: A Vectorized Graph Representation for Breadth-First Search}, venue = {Orlando, FL, USA}, year = {2017} }

Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, May 2017

We present a new parallel algorithm for solving triangular systems with multiple right hand sides (TRSM). TRSM is used extensively in numerical linear algebra computations, both to solve triangular linear systems of equations as well as to compute factorizations with triangular matrices, such as Cholesky, LU, and QR. Our algorithm achieves better theoretical scalability than known alternatives, while maintaining numerical stability, via selective use of triangular matrix inversion. We leverage the fact that triangular inversion and matrix multiplication are more parallelizable than the standard TRSM algorithm. By only inverting triangular blocks along the diagonal of the initial matrix, we generalize the usual way of TRSM computation and the full matrix inversion approach. This flexibility leads to an efficient algorithm for any ratio of the number of right hand sides to the triangular matrix dimension. We provide a detailed communication cost analysis for our algorithm as well as for the recursive triangular matrix inversion. This cost analysis makes it possible to determine optimal block sizes and processor grids a priori. Relative to the best known algorithms for TRSM, our approach can require asymptotically fewer messages, while performing optimal amounts of computation and communication in terms of words sent.

@inproceedings{abc, abstract = {We present a new parallel algorithm for solving triangular systems with multiple right hand sides (TRSM). TRSM is used extensively in numerical linear algebra computations, both to solve triangular linear systems of equations as well as to compute factorizations with triangular matrices, such as Cholesky, LU, and QR. Our algorithm achieves better theoretical scalability than known alternatives, while maintaining numerical stability, via selective use of triangular matrix inversion. We leverage the fact that triangular inversion and matrix multiplication are more parallelizable than the standard TRSM algorithm. By only inverting triangular blocks along the diagonal of the initial matrix, we generalize the usual way of TRSM computation and the full matrix inversion approach. This flexibility leads to an efficient algorithm for any ratio of the number of right hand sides to the triangular matrix dimension. We provide a detailed communication cost analysis for our algorithm as well as for the recursive triangular matrix inversion. This cost analysis makes it possible to determine optimal block sizes and processor grids a priori. Relative to the best known algorithms for TRSM, our approach can require asymptotically fewer messages, while performing optimal amounts of computation and communication in terms of words sent.}, author = {Tobias Wicky and Edgar Solomonik and Torsten Hoefler}, booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)}, title = {Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations}, venue = {Orlando, FL, USA}, year = {2017} }