Publications by Rachata Ausavarungnirun

×

Status message

The Publications site is currently under construction, as a result some publications might be missing.

2018

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.
@inproceedings{abc,
	abstract = {Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. 
In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp{\textquoteright}s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8{\texttimes} larger capacity and improving overall GPU performance by 31\% while reducing register file power consumption by 46\%.},
	author = {Mohammad Sadrosadati and Amirhossein Mirhosseini and Seyed B. Ehsani and Hamid Sarbazi-Azad and Mario Drumond and Babak Falsafi and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching},
	url = {https://dl.acm.org/citation.cfm?id=3173211},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google’s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-inmemory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4% across the workloads) and execution time (by an average of 54.2%).
@inproceedings{abc,
	abstract = {We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google{\textquoteright}s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-inmemory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4\% across the workloads) and execution time (by an average of 54.2\%).},
	author = {Amirali Boroumand and Saugata Ghose and Youngsok Kim and Rachata Ausavarungnirun and Eric Shiu and Rahul Thakur and Dae-Hyun Kim and Aki Kuusela and Allan Knies and Parthasarathy Ranganathan and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.
@inproceedings{abc,
	abstract = {Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.},
	author = {Maciej Besta and Syed M. Hassan and Sudhakar Yalamanchili and Rachata Ausavarungnirun and Onur Mutlu and Torsten Hoefler},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK’s system throughput is within 23.2% of an ideal GPU system with no address translation overhead.
@inproceedings{abc,
	abstract = {Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8\%, improves IPC throughput by 43.4\%, and reduces applicationlevel unfairness by 22.4\%. MASK{\textquoteright}s system throughput is within 23.2\% of an ideal GPU system with no address translation overhead.},
	author = {Rachata Ausavarungnirun and Vance Miller and Joshua Landgraf and Saugata Ghose and Jayneel Gandhi and Adwait Jog and Christopher Rossbach and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}

2017

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017
Contemporary discrete GPUs support rich memory management features such as virtual memory and demand paging. These features simplify GPU programming by providing a virtual address space abstraction similar to CPUs and eliminating manual memory management, but they introduce high performance overheads during (1) address translation and (2) page faults. A GPU relies on high degrees of thread-level parallelism (TLP) to hide memory latency. Address translation can undermine TLP, as a single miss in the translation lookaside buffer (TLB) invokes an expensive serialized page table walk that often stalls multiple threads. Demand paging can also undermine TLP, as multiple threads often stall while they wait for an expensive data transfer over the system I/O (e.g., PCIe) bus when the GPU demands a page. In modern GPUs, we face a trade-off on how the page size used for memory management affects address translation and demand paging. The address translation overhead is lower when we employ a larger page size (e.g., 2MB large pages, compared with conventional 4KB base pages), which increases TLB coverage and thus reduces TLB misses. Conversely, the demand paging overhead is lower when we employ a smaller page size, which decreases the system I/O bus transfer latency. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing (i.e., merging base pages into a large page) and splintering (i.e., splitting a large page into base pages) policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once). We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages. We introduce Mosaic, a GPU memory manager that provides application-transparent support for multiple page sizes. Mosaic uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that (1) preserves base page contiguity and (2) ensures that a large page frame contains pages from only a single memory protection domain. We take advantage of this allocation strategy to design a novel in-place page size selection mechanism that avoids data migration. This mechanism allows the TLB to use large pages, reducing address translation overhead. During data transfer, this mechanism enables the GPU to transfer only the base pages that are needed by the application over the system I/O bus, keeping demand paging overhead low. Our evaluations show that Mosaic reduces address translation overheads while efficiently achieving the benefits of demand paging, compared to a contemporary GPU that uses only a 4KB page size. Relative to a state-of-the-art GPU memory manager, Mosaic improves the performance of homogeneous and heterogeneous multi-application workloads by 55.5% and 29.7% on average, respectively, coming within 6.8% and 15.4% of the performance of an ideal TLB where all TLB requests are hits.
@inproceedings{abc,
	abstract = {Contemporary discrete GPUs support rich memory management features such as virtual memory and demand paging. These features simplify GPU programming by providing a virtual address space abstraction similar to CPUs and eliminating manual memory management, but they introduce high performance overheads during (1) address translation and (2) page faults. A GPU relies on high degrees of thread-level parallelism (TLP) to hide memory latency. Address translation can undermine TLP, as a single miss in the translation lookaside buffer (TLB) invokes an expensive serialized page table walk that often stalls multiple threads. Demand paging can also undermine TLP, as multiple threads often stall while they wait for an expensive data transfer over the system I/O (e.g., PCIe) bus when the GPU demands a page.

In modern GPUs, we face a trade-off on how the page size used for memory management affects address translation and demand paging. The address translation overhead is lower when we employ a larger page size (e.g., 2MB large pages, compared with conventional 4KB base pages), which increases TLB coverage and thus reduces TLB misses. Conversely, the demand paging overhead is lower when we employ a smaller page size, which decreases the system I/O bus transfer latency. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing (i.e., merging base pages into a large page) and splintering (i.e., splitting a large page into base pages) policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once). We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages.

We introduce Mosaic, a GPU memory manager that provides application-transparent support for multiple page sizes. Mosaic uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that (1) preserves base page contiguity and (2) ensures that a large page frame contains pages from only a single memory protection domain. We take advantage of this allocation strategy to design a novel in-place page size selection mechanism that avoids data migration. This mechanism allows the TLB to use large pages, reducing address translation overhead. During data transfer, this mechanism enables the GPU to transfer only the base pages that are needed by the application over the system I/O bus, keeping demand paging overhead low. Our evaluations show that Mosaic reduces address translation overheads while efficiently achieving the benefits of demand paging, compared to a contemporary GPU that uses only a 4KB page size. Relative to a state-of-the-art GPU memory manager, Mosaic improves the performance of homogeneous and heterogeneous multi-application workloads by 55.5\% and 29.7\% on average, respectively, coming within 6.8\% and 15.4\% of the performance of an ideal TLB where all TLB requests are hits.},
	author = {Rachata Ausavarungnirun and Joshua Landgraf and Vance Miller and Saugata Ghose and Jayneel Gandhi and Christopher Rossbach and },
	booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
	title = {Mosaic: a GPU memory manager with application-transparent support for multiple page sizes},
	venue = {Cambridge, MA, USA},
	year = {2017}
}
Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA, June 2017
@inproceedings{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Saugata Ghose and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu},
	booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA},
	title = {Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.},
	url = {http://doi.acm.org/10.1145/3078505.3078533},
	year = {2017}
}
POMACS, January 2017
@article{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Saugata Ghose and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu},
	journal = {POMACS},
	title = {Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.},
	url = {http://doi.acm.org/10.1145/3084464},
	year = {2017}
}

2016

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 2016
@inproceedings{abc,
	author = {Onur Kayiran and Adwait Jog and Ashutosh Pattnaik and Rachata Ausavarungnirun and Xulong Tang and Mahmut T. Kandemir and Gabriel H. Loh and Onur Mutlu and Chita R. Das},
	booktitle = {Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel},
	title = {{\^I}{\textonequarter}C-States: Fine-grained GPU Datapath Power Management.},
	url = {http://doi.acm.org/10.1145/2967938.2967941},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Yang Li and Di Wang and Saugata Ghose and Jie Liu and Sriram Govindan and Sean James and Eric Peterson and John Siegler and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {SizeCap: Efficiently handling power surges in fuel cell powered data centers.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446085},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	journal = {CoRR},
	title = {Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing.},
	url = {http://arxiv.org/abs/1602.06005},
	year = {2016}
}
CoRR, January 2016
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.
@article{abc,
	abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. 
This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. 
CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. 
We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
	author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Saugata Ghose and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
	journal = {CoRR},
	title = {A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps.},
	url = {http://arxiv.org/abs/1602.01348},
	year = {2016}
}
Parallel Computing, January 2016
@article{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	journal = {Parallel Computing},
	title = {A case for hierarchical rings with deflection routing: An energy-efficient on-chip communication substrate.},
	url = {http://dx.doi.org/10.1016/j.parco.2016.01.009},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Saugata Ghose and Onur Mutlu},
	journal = {CoRR},
	title = {Reducing DRAM Latency by Exploiting Design-Induced Latency Variation in Modern DRAM Chips.},
	url = {http://arxiv.org/abs/1610.09604},
	year = {2016}
}

2015

2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 2015
@inproceedings{abc,
	author = {Donghyuk Lee and Lavanya Subramanian and Rachata Ausavarungnirun and Jongmoo Choi and Onur Mutlu},
	booktitle = {2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA},
	title = {Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM.},
	url = {http://dx.doi.org/10.1109/PACT.2015.51},
	year = {2015}
}
2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 2015
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Saugata Ghose and Onur Kayiran and Gabriel H. Loh and Chita R. Das and Mahmut T. Kandemir and Onur Mutlu},
	booktitle = {2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA},
	title = {Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance.},
	url = {http://dx.doi.org/10.1109/PACT.2015.38},
	year = {2015}
}
Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS 2015, Vancouver, BC, Canada, September 2015
@inproceedings{abc,
	author = {Mohammad Fattah and Antti Airola and Rachata Ausavarungnirun and Nima Mirzaei and Pasi Liljeberg and Juha Plosila and Siamak Mohammadi and Tapio Pahikkala and Onur Mutlu and Hannu Tenhunen},
	booktitle = {Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS 2015, Vancouver, BC, Canada},
	title = {A Low-Overhead, Fully-Distributed, Guaranteed-Delivery Routing Algorithm for Faulty Network-on-Chips.},
	url = {http://doi.acm.org/10.1145/2786572.2786591},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate “assist warps” that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.
@inproceedings{abc,
	abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the 
 cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate {\textquotedblleft}assist warps{\textquotedblright} that execute on GPU cores to perform specific tasks that can improve GPU performance and 
 efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
	author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture},
	title = {A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.},
	url = {http://doi.acm.org/10.1145/2749469.2750399},
	venue = {Portland, OR, USA},
	year = {2015}
}

2014

47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 2014
@inproceedings{abc,
	author = {Onur Kayiran and Nachiappan Chidambaram Nachiappan and Adwait Jog and Rachata Ausavarungnirun and Mahmut T. Kandemir and Gabriel H. Loh and Onur Mutlu and Chita R. Das},
	booktitle = {47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom},
	title = {Managing GPU Concurrency in Heterogeneous Architectures.},
	url = {http://dx.doi.org/10.1109/MICRO.2014.62},
	year = {2014}
}
26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 2014
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	booktitle = {26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France},
	title = {Design and Evaluation of Hierarchical Rings with Deflection Routing.},
	url = {http://dx.doi.org/10.1109/SBAC-PAD.2014.31},
	year = {2014}
}

2013

The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 2013
@inproceedings{abc,
	author = {Vivek Seshadri and Yoongu Kim and Chris Fallin and Donghyuk Lee and Rachata Ausavarungnirun and Gennady Pekhimenko and Yixin Luo and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA},
	title = {RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.},
	url = {http://doi.acm.org/10.1145/2540708.2540725},
	year = {2013}
}
19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013
@inproceedings{abc,
	author = {Reetuparna Das and Rachata Ausavarungnirun and Onur Mutlu and Akhilesh Kumar and Mani Azimi},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {Application-to-core mapping policies to reduce memory system interference in multi-core systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522311},
	year = {2013}
}

2012

IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, New York, NY, USA, October 2012
@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Rachata Ausavarungnirun and Chris Fallin and Onur Mutlu},
	booktitle = {IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, New York, NY, USA},
	title = {HAT: Heterogeneous Adaptive Throttling for On-Chip Networks.},
	url = {http://doi.ieeecomputersociety.org/10.1109/SBAC-PAD.2012.44},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Reetuparna Das and Rachata Ausavarungnirun and Onur Mutlu and Akhilesh Kumar and Mani Azimi},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Application-to-core mapping policies to reduce memory interference in multi-core systems.},
	url = {http://doi.acm.org/10.1145/2370816.2370893},
	year = {2012}
}
30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012
@inproceedings{abc,
	author = {HanBin Yoon and Justin Meza and Rachata Ausavarungnirun and Rachael Harding and Onur Mutlu},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {Row buffer locality aware caching policies for hybrid memories.},
	url = {http://dx.doi.org/10.1109/ICCD.2012.6378661},
	year = {2012}
}
39th International Symposium on Computer Architecture (ISCA 2012), Portland, OR, USA, June 2012
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Kevin Kai-Wei Chang and Lavanya Subramanian and Gabriel H. Loh and Onur Mutlu},
	booktitle = {39th International Symposium on Computer Architecture (ISCA 2012)},
	title = {Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems.},
	url = {http://dx.doi.org/10.1109/ISCA.2012.6237036},
	venue = {Portland, OR, USA},
	year = {2012}
}
2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), Copenhagen, Denmark, May 2012
@inproceedings{abc,
	author = {Chris Fallin and Greg Nazario and Xiangyao Yu and Kevin Kai-Wei Chang and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), Copenhagen, Denmark},
	title = {MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect.},
	url = {http://dx.doi.org/10.1109/NOCS.2012.8},
	year = {2012}
}