Publications by Nandita%20Vijaykumar | Publications

Status message

The Publications site is currently under construction, as a result some publications might be missing.

2017

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.

Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu

14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 2017

@inproceedings{abc,
	author = {Kevin Hsieh and Aaron Harlap and Nandita Vijaykumar and Dimitris Konomis and Gregory R. Ganger and Phillip B. Gibbons and Onur Mutlu},
	booktitle = {14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA},
	title = {Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.},
	url = {https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh},
	year = {2017}
}

SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.

Hasan Hassan, Nandita Vijaykumar, Samira Manabi Khan, Saugata Ghose, Kevin K. Chang, Gennady Pekhimenko, Donghyuk Lee, Oguz Ergin, and Onur Mutlu

2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 2017

@inproceedings{abc,
	author = {Hasan Hassan and Nandita Vijaykumar and Samira Manabi Khan and Saugata Ghose and Kevin K. Chang and Gennady Pekhimenko and Donghyuk Lee and Oguz Ergin and Onur Mutlu},
	booktitle = {2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA},
	title = {SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.},
	url = {https://doi.org/10.1109/HPCA.2017.62},
	year = {2017}
}

2016

Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation.

Kevin Hsieh, Samira Manabi Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu

34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA, October 2016

@inproceedings{abc,
	author = {Kevin Hsieh and Samira Manabi Khan and Nandita Vijaykumar and Kevin K. Chang and Amirali Boroumand and Saugata Ghose and Onur Mutlu},
	booktitle = {34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA},
	title = {Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation.},
	url = {http://dx.doi.org/10.1109/ICCD.2016.7753257},
	year = {2016}
}

Zorua: A holistic approach to resource virtualization in GPUs.

Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Manabi Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, and Onur Mutlu

49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 2016

@inproceedings{abc,
	author = {Nandita Vijaykumar and Kevin Hsieh and Gennady Pekhimenko and Samira Manabi Khan and Ashish Shrestha and Saugata Ghose and Adwait Jog and Phillip B. Gibbons and Onur Mutlu},
	booktitle = {49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan},
	title = {Zorua: A holistic approach to resource virtualization in GPUs.},
	url = {http://dx.doi.org/10.1109/MICRO.2016.7783718},
	year = {2016}
}

Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler

43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 2016

@inproceedings{abc,
	author = {Kevin Hsieh and Eiman Ebrahimi and Gwangsun Kim and Niladrish Chatterjee and Mike O{\textquoteright}Connor and Nandita Vijaykumar and Onur Mutlu and Stephen W. Keckler},
	booktitle = {43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea},
	title = {Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ISCA.2016.27},
	year = {2016}
}

A case for toggle-aware compression for GPU systems.

Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler

2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016

@inproceedings{abc,
	author = {Gennady Pekhimenko and Evgeny Bolotin and Nandita Vijaykumar and Onur Mutlu and Todd C. Mowry and Stephen W. Keckler},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {A case for toggle-aware compression for GPU systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446064},
	year = {2016}
}

ChargeCache: Reducing DRAM latency by exploiting row access locality.

Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, and Onur Mutlu

2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016

@inproceedings{abc,
	author = {Hasan Hassan and Gennady Pekhimenko and Nandita Vijaykumar and Vivek Seshadri and Donghyuk Lee and Oguz Ergin and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {ChargeCache: Reducing DRAM latency by exploiting row access locality.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446096},
	year = {2016}
}

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps.

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Saugata Ghose, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita R. Das, Mahmut T. Kandemir, Todd C. Mowry, and Onur Mutlu

CoRR, January 2016

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.

@article{abc,
abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.
This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency.
CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps.
We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Saugata Ghose and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
journal = {CoRR},
title = {A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps.},
url = {http://arxiv.org/abs/1602.01348},
year = {2016}
}

2015

A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita R. Das, Mahmut T. Kandemir, Todd C. Mowry, and Onur Mutlu

Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate “assist warps” that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.

@inproceedings{abc,
	abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the 
 cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate {\textquotedblleft}assist warps{\textquotedblright} that execute on GPU cores to perform specific tasks that can improve GPU performance and 
 efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
	author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture},
	title = {A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.},
	url = {http://doi.acm.org/10.1145/2749469.2750399},
	venue = {Portland, OR, USA},
	year = {2015}
}