Publications by Onur%20Mutlu

×

Status message

The Publications site is currently under construction, as a result some publications might be missing.

2018

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google’s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-inmemory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4% across the workloads) and execution time (by an average of 54.2%).
@inproceedings{abc,
	abstract = {We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google{\textquoteright}s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-inmemory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4\% across the workloads) and execution time (by an average of 54.2\%).},
	author = {Amirali Boroumand and Saugata Ghose and Youngsok Kim and Rachata Ausavarungnirun and Eric Shiu and Rahul Thakur and Dae-Hyun Kim and Aki Kuusela and Allan Knies and Parthasarathy Ranganathan and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.
@inproceedings{abc,
	abstract = {Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.},
	author = {Maciej Besta and Syed M. Hassan and Sudhakar Yalamanchili and Rachata Ausavarungnirun and Onur Mutlu and Torsten Hoefler},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK’s system throughput is within 23.2% of an ideal GPU system with no address translation overhead.
@inproceedings{abc,
	abstract = {Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8\%, improves IPC throughput by 43.4\%, and reduces applicationlevel unfairness by 22.4\%. MASK{\textquoteright}s system throughput is within 23.2\% of an ideal GPU system with no address translation overhead.},
	author = {Rachata Ausavarungnirun and Vance Miller and Joshua Landgraf and Saugata Ghose and Jayneel Gandhi and Adwait Jog and Christopher Rossbach and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.
@inproceedings{abc,
	abstract = {Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. 
In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp{\textquoteright}s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8{\texttimes} larger capacity and improving overall GPU performance by 31\% while reducing register file power consumption by 46\%.},
	author = {Mohammad Sadrosadati and Amirhossein Mirhosseini and Seyed B. Ehsani and Hamid Sarbazi-Azad and Mario Drumond and Babak Falsafi and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching},
	url = {https://dl.acm.org/citation.cfm?id=3173211},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Resource management strategies for many-core systems need to enable sharing of resources such as power, processing cores, and memory bandwidth while coordinating the priority and significance of system- and application-level objectives at runtime in a scalable and robust manner. State-of-the-art approaches use heuristics or machine learning for resource management, but unfortunately lack formalism in providing robustness against unexpected corner cases. While recent efforts deploy classical control-theoretic approaches with some guarantees and formalism, they lack scalability and autonomy to meet changing runtime goals. We present SPECTR, a new resource management approach for many-core systems that leverages formal supervisory control theory (SCT) to combine the strengths of classical control theory with state-of-the-art heuristic approaches to efficiently meet changing runtime goals. SPECTR is a scalable and robust control architecture and a systematic design flow for hierarchical control of many-core systems. SPECTR leverages SCT techniques such as gain scheduling to allow autonomy for individual controllers. It facilitates automatic synthesis of the high-level supervisory controller and its property verification. We implement SPECTR on an Exynos platform containing ARM’s big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that SPECTR’s use of SCT is key to managing multiple interacting resources (e.g., chip power and processing cores) in the presence of competing objectives (e.g., satisfying QoS vs. power capping). The principles of SPECTR are easily applicable to any resource type and objective as long as the management problem can be modeled using dynamical systems theory (e.g., difference equations), discrete-event dynamic systems, or fuzzy dynamics.
@inproceedings{abc,
	abstract = {Resource management strategies for many-core systems need to enable sharing of resources such as power, processing cores, and memory bandwidth while coordinating the priority and significance of system- and application-level objectives at runtime in a scalable and robust manner. State-of-the-art approaches use heuristics or machine learning for resource management, but unfortunately lack formalism in providing robustness against unexpected corner cases. While recent efforts deploy classical control-theoretic approaches with some guarantees and formalism, they lack scalability and autonomy to meet changing runtime goals. We present SPECTR, a new resource management approach for many-core systems that leverages formal supervisory control theory (SCT) to combine the strengths of classical control theory with state-of-the-art heuristic approaches to efficiently meet changing runtime goals. SPECTR is a scalable and robust control architecture and a systematic design flow for hierarchical control of many-core systems. SPECTR leverages SCT techniques such as gain scheduling to allow autonomy for individual controllers. It facilitates automatic synthesis of the high-level supervisory controller and its property verification. We implement SPECTR on an Exynos platform containing ARM{\textquoteright}s big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that SPECTR{\textquoteright}s use of SCT is key to managing multiple interacting resources (e.g., chip power and processing cores) in the presence of competing objectives (e.g., satisfying QoS vs. power capping). The principles of SPECTR are easily applicable to any resource type and objective as long as the management problem can be modeled using dynamical systems theory (e.g., difference equations), discrete-event dynamic systems, or fuzzy dynamics.},
	author = {Amir M. Rahmani and Bryan Donyanavard and Tiago Mück and Kasra Moazzemi and Axel Jantsch and Onur Mutlu and Nikil Dutt},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {SPECTR: Formal Supervisory Control and Coordination for Many-core Systems Resource Management},
	url = {https://dl.acm.org/citation.cfm?id=3173199},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018
NAND flash memory density continues to scale to keep up with the increasing storage demands of data-intensive applications. Unfortunately, as a result of this scaling, the lifetime of NAND flash memory has been decreasing. Each cell in NAND flash memory can endure only a limited number of writes, due to the damage caused by each program and erase operation on the cell. This damage can be partially repaired on its own during the idle time between program or erase operations (known as the dwell time), via a phenomenon known as the self-recovery effect. Prior works study the self-recovery effect for planar (i.e., 2D) NAND flash memory, and propose to exploit it to improve flash lifetime, by applying high temperature to accelerate selfrecovery. However, these findings may not be directly applicable to 3D NAND flash memory, due to significant changes in the design and manufacturing process that are required to enable practical 3D stacking for NAND flash memory. In this paper, we perform the first detailed experimental characterization of the effects of self-recovery and temperature on real, state-of-the-art 3D NAND flash memory devices. We show that these effects influence two major factors of NAND flash memory reliability: (1) retention loss speed (i.e., the speed at which a flash cell leaks charge), and (2) program variation (i.e., the difference in programming speed across flash cells). We find that self-recovery and temperature affect 3D NAND flash memory quite differently than they affect planar NAND flash memory, rendering prior models of self-recovery and temperature ineffective for 3D NAND flash memory. Using our characterization results, we develop a new model for 3D NAND flash memory reliability, which predicts how retention, wearout, self-recovery, and temperature affect raw bit error rates and cell threshold voltages. We show that our model is accurate, with an error of only 4.9%. Based on our experimental findings and our model, we propose HeatWatch, a new mechanism to improve 3D NAND flash memory reliability. The key idea of HeatWatch is to optimize the read reference voltage, i.e., the voltage applied to the cell during a read operation, by adapting it to the dwell time of the workload and the current operating temperature. HeatWatch (1) efficiently tracks flash memory temperature and dwell time online, (2) sends this information to our reliability model to predict the current voltages of flash cells, and (3) predicts the optimal read reference voltage based on the current cell voltages. Our detailed experimental evaluations show that HeatWatch improves flash lifetime by 3.85× over a baseline that uses a fixed read reference voltage, averaged across 28 real storage workload traces, and comes within 0.9% of the lifetime of an ideal read reference voltage selection mechanism.
@inproceedings{abc,
	abstract = {NAND flash memory density continues to scale to keep up with the increasing storage demands of data-intensive applications. Unfortunately, as a result of this scaling, the lifetime of NAND flash memory has been decreasing. Each cell in NAND flash memory can endure only a limited number of writes, due to the damage caused by each program and erase operation on the cell. This damage can be partially repaired on its own during the idle time between program or erase operations (known as the dwell time), via a phenomenon known as the self-recovery effect. Prior works study the self-recovery effect for planar (i.e., 2D) NAND flash memory, and propose to exploit it to improve flash lifetime, by applying high temperature to accelerate selfrecovery. However, these findings may not be directly applicable to 3D NAND flash memory, due to significant changes in the design and manufacturing process that are required to enable practical 3D stacking for NAND flash memory. In this paper, we perform the first detailed experimental characterization of the effects of self-recovery and temperature on real, state-of-the-art 3D NAND flash memory devices. We show that these effects influence two major factors of NAND flash memory reliability: (1) retention loss speed (i.e., the speed at which a flash cell leaks charge), and (2) program variation (i.e., the difference in programming speed across flash cells). We find that self-recovery and temperature affect 3D NAND flash memory quite differently than they affect planar NAND flash memory, rendering prior models of self-recovery and temperature ineffective for 3D NAND flash memory. Using our characterization results, we develop a new model for 3D NAND flash memory reliability, which predicts how retention, wearout, self-recovery, and temperature affect raw bit error rates and cell threshold voltages. We show that our model is accurate, with an error of only 4.9\%. Based on our experimental findings and our model, we propose HeatWatch, a new mechanism to improve 3D NAND flash memory reliability. The key idea of HeatWatch is to optimize the read reference voltage, i.e., the voltage applied to the cell during a read operation, by adapting it to the dwell time of the workload and the current operating temperature. HeatWatch (1) efficiently tracks flash memory temperature and dwell time online, (2) sends this information to our reliability model to predict the current voltages of flash cells, and (3) predicts the optimal read reference voltage based on the current cell voltages. Our detailed experimental evaluations show that HeatWatch improves flash lifetime by 3.85{\texttimes} over a baseline that uses a fixed read reference voltage, averaged across 28 real storage workload traces, and comes within 0.9\% of the lifetime of an ideal read reference voltage selection mechanism.},
	author = {Yixin Luo and Saugata Ghose and Yu Cai and Erich F. Haratsch and Onur Mutlu},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA)},
	title = {HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature-Awareness},
	venue = {Vienna, Austria},
	year = {2018}
}
Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018
Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55◦C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70◦C and 1426x (868x, 1783x) at 55◦C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.
@inproceedings{abc,
	abstract = {Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55{\textopenbullet}C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70{\textopenbullet}C and 1426x (868x, 1783x) at 55{\textopenbullet}C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.},
	author = {Jeremie Kim and Minesh Patel and Hasan Hassan and Onur Mutlu},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA)},
	title = {The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Mod...},
	venue = {Vienna, Austria},
	year = {2018}
}
Proceedings of the 16th USENIX Conference on File and Storage Technologies, Oakland, CA, USA, February 2018
Solid-state drives (SSDs) are used in a wide array of computer systems today, including in datacenters and enterprise servers. As the I/O demands of these systems continue to increase, manufacturers are evolving SSD architectures to keep up with this demand. For example, manufacturers have introduced new high-bandwidth interfaces to replace the conventional SATA host-interface protocol. These new interfaces, such as the NVMe protocol, are designed specifically to enable the high amounts of concurrent I/O bandwidth that SSDs are capable of delivering. While modern SSDs with sophisticated features such as the NVMe protocol are already on the market, existing SSD simulation tools have fallen behind, as they do not capture these new features. We find that state-of-the-art SSD simulators have three shortcomings that prevent them from accurately modeling the performance of real off-the-shelf SSDs. First, these simulators do not model critical features of new protocols (e.g., NVMe), such as their use of multiple application-level queues for requests and the elimination of OS intervention for I/O request processing. Second, these simulators often do not accurately capture the impact of advanced SSD maintenance algorithms (e.g., garbage collection), as they do not properly or quickly emulate steady-state conditions that can significantly change the behavior of these algorithms in real SSDs. Third, these simulators do not capture the full end-to-end latency of I/O requests, which can incorrectly skew the results reported for SSDs that make use of emerging non-volatile memory technologies. By not accurately modeling these three features, existing simulators report results that deviate significantly from real SSD performance. In this work, we introduce a new simulator, called MQSim, that accurately models the performance of both modern SSDs and conventional SATA-based SSDs. MQSim faithfully models new high-bandwidth protocol implementations, steady-state SSD conditions, and the full end-to-end latency of requests in modern SSDs. We validate MQSim, showing that it reports performance results that are only 6%-18% apart from the measured actual performance of four real state-of-the-art SSDs. We show that by modeling critical features of modern SSDs, MQSim uncovers several real and important issues that were not captured by existing simulators, such as the performance impact of inter-flow interference. We have released MQSim as an open-source tool, and we hope that it can enable researchers to explore directions in new and different areas.
@inproceedings{abc,
	abstract = {Solid-state drives (SSDs) are used in a wide array of computer systems today, including in datacenters and enterprise servers. As the I/O demands of these systems continue to increase, manufacturers are evolving SSD architectures to keep up with this demand. For example, manufacturers have introduced new high-bandwidth interfaces to replace the conventional SATA host-interface protocol. These new interfaces, such as the NVMe protocol, are designed specifically to enable the high amounts of concurrent I/O bandwidth that SSDs are capable of delivering.

While modern SSDs with sophisticated features such as the NVMe protocol are already on the market, existing SSD simulation tools have fallen behind, as they do not capture these new features. We find that state-of-the-art SSD simulators have three shortcomings that prevent them from accurately modeling the performance of real off-the-shelf SSDs. First, these simulators do not model critical features of new protocols (e.g., NVMe), such as their use of multiple application-level queues for requests and the elimination of OS intervention for I/O request processing. Second, these simulators often do not accurately capture the impact of advanced SSD maintenance algorithms (e.g., garbage collection), as they do not properly or quickly emulate steady-state conditions that can significantly change the behavior of these algorithms in real SSDs. Third, these simulators do not capture the full end-to-end latency of I/O requests, which can incorrectly skew the results reported for SSDs that make use of emerging non-volatile memory technologies. By not accurately modeling these three features, existing simulators report results that deviate significantly from real SSD performance.

In this work, we introduce a new simulator, called MQSim, that accurately models the performance of both modern SSDs and conventional SATA-based SSDs. MQSim faithfully models new high-bandwidth protocol implementations, steady-state SSD conditions, and the full end-to-end latency of requests in modern SSDs. We validate MQSim, showing that it reports performance results that are only 6\%-18\% apart from the measured actual performance of four real state-of-the-art SSDs. We show that by modeling critical features of modern SSDs, MQSim uncovers several real and important issues that were not captured by existing simulators, such as the performance impact of inter-flow interference. We have released MQSim as an open-source tool, and we hope that it can enable researchers to explore directions in new and different areas.},
	author = {Arash Tavakkol and Juan Gomez-Luna and Mohammad Sadrosadati and Saugata Ghose and Onur Mutlu},
	booktitle = {Proceedings of the 16th USENIX Conference on File and Storage Technologies},
	title = {MQsim: a framework for enabling realistic studies of modern multi-queue SSD devices},
	venue = {Oakland, CA, USA},
	year = {2018}
}

2017

Bioinformatics, November 2017
Motivation High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and ‘candidate’ locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper’s execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. Results We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.
@article{abc,
	abstract = {Motivation
High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and {\textquoteleft}candidate{\textquoteright} locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper{\textquoteright}s execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms.

Results
We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96\%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.},
	author = {Mohammed Alser and Hasan Hassan and Hongyi Xin and Oguz Ergin and Onur Mutlu and Can Alkan},
	pages = {3355-3363},
	journal = {Bioinformatics},
	title = {GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping},
	volume = {33},
	year = {2017}
}
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017
DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge. In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle. Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65--74%, leading to a 10%/17%/40% (min) to 12%/22%/50% (max) performance improvement for a single-core and 10%/23%/52% (min) to 17%/29%/65% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.
@inproceedings{abc,
	abstract = {DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge.

In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle.

Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65--74\%, leading to a 10\%/17\%/40\% (min) to 12\%/22\%/50\% (max) performance improvement for a single-core and 10\%/23\%/52\% (min) to 17\%/29\%/65\% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.},
	author = {Samira Manabi Khan and Chris Wilkerson and Zhe Wang and Alaa R. Alameldeen and Donghyuk Lee and Onur Mutlu},
	booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
	title = {Detecting and mitigating data-dependent DRAM failures by exploiting current memory content},
	venue = {Cambridge, MA, USA},
	year = {2017}
}
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017
Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory). To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus. Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.
@inproceedings{abc,
	abstract = {Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).

To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1\% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus.

Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.},
	author = {Vivek Seshadri and Donghyuk Lee and Thomas Mullins and Hasan Hassan and Amirali Boroumand and Jeremie Kim and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
	booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
	title = {Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology},
	venue = {Cambridge, MA, USA},
	year = {2017}
}
Proceedins of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA, September 2017
While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity. Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (e.g., different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy. The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology. Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page. It is important to make intelligent page management (i.e., placement and migration) decisions, as they can significantly affect system performance.In this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (i.e., the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement. UH-MEM operates in two steps. First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism. Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migration.We evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories. For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14% on average (and up to 26%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads.
@inproceedings{abc,
	abstract = {While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity. Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (e.g., different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy. The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology. Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page. It is important to make intelligent page management (i.e., placement and migration) decisions, as they can significantly affect system performance.In this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (i.e., the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement. UH-MEM operates in two steps. First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism. Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migration.We evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories. For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14\% on average (and up to 26\%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads.},
	author = {Yang Li and Saugata Ghose and Jongmoo Choi and Jin Sun and Hui Wang and Onur Mutlu},
	booktitle = {Proceedins of the 2017 IEEE International Conference on Cluster Computing (CLUSTER)},
	title = {Utility-Based Hybrid Memory Management},
	venue = {Honolulu, HI, USA},
	year = {2017}
}
Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 2017
@inproceedings{abc,
	author = {Zhiyu Liu and Irina Calciu and Maurice Herlihy and Onur Mutlu},
	booktitle = {Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA},
	title = {Concurrent Data Structures for Near-Memory Computing.},
	url = {http://doi.acm.org/10.1145/3087556.3087582},
	year = {2017}
}
Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA, June 2017
@inproceedings{abc,
	author = {Kevin K. Chang and Abdullah Giray Yaglik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.},
	url = {http://doi.acm.org/10.1145/3078505.3078590},
	year = {2017}
}
Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA, June 2017
@inproceedings{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Saugata Ghose and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu},
	booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA},
	title = {Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.},
	url = {http://doi.acm.org/10.1145/3078505.3078533},
	year = {2017}
}
Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 2017
@inproceedings{abc,
	author = {Xi-Yue Xiang and Wentao Shi and Saugata Ghose and Lu Peng and Onur Mutlu and Nian-Feng Tzeng},
	booktitle = {Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA},
	title = {Carpool: a bufferless on-chip network supporting adaptive multicast and hotspot alleviation.},
	url = {http://doi.acm.org/10.1145/3079079.3079090},
	year = {2017}
}
Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 2017
@inproceedings{abc,
	author = {Minesh Patel and Jeremie Kim and Onur Mutlu},
	booktitle = {Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada},
	title = {The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions.},
	url = {http://doi.acm.org/10.1145/3079856.3080242},
	year = {2017}
}
25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, Napa, CA, USA, April 2017
Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme-called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.
@inproceedings{abc,
	abstract = {Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme-called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.},
	author = {Kaan Kara and Dan Alistarh and Gustavo Alonso and Onur Mutlu and Ce Zhang},
	booktitle = {25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA},
	title = {FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off.},
	url = {https://doi.org/10.1109/FCCM.2017.39},
	venue = {Napa, CA, USA},
	year = {2017}
}
Proceedings of the 2017 Digital Forensics Conference, Überlingen, Germany, March 2017
Digital forensic investigators often need to extract data from a seized device that contains NAND flash memory. Many such devices are physically damaged, preventing investigators from using automated techniques to extract the data stored within the device. Instead, investigators turn to chip-off analysis, where they use a thermal-based procedure to physically remove the NAND flash memory chip from the device, and access the chip directly to extract the raw data stored on the chip. We perform an analysis of the errors introduced into multi-level cell (MLC) NAND flash memory chips after the device has been seized. We make two major observations. First, between the time that a device is seized and the time digital forensic investigators perform data extraction, a large number of errors can be introduced as a result of charge leakage from the cells of the NAND flash memory (known as data retention errors). Second, when thermal-based chip removal is performed, the number of errors in the data stored within NAND flash memory can increase by two or more orders of magnitude, as the high temperature applied to the chip greatly accelerates charge leakage. We demonstrate that the chip-off analysis based forensic data recovery procedure is quite destructive, and can often render most of the data within NAND flash memory uncorrectable, and, thus, unrecoverable. To mitigate the errors introduced during the forensic recovery process, we explore a new hardware- based approach. We exploit a fine-grained read reference voltage control mechanism implemented in modern NAND flash memory chips, called read-retry, which can compensate for the charge leakage that occurs due to (1) retention loss and (2) thermal-based chip removal. The read-retry mechanism successfully reduces the number of errors, such that the original data can be fully recovered in our tested chips as long as the chips were not heavily used prior to seizure. We conclude that the read-retry mechanism should be adopted as part of the forensic data recovery process.
@inproceedings{abc,
	abstract = {Digital forensic investigators often need to extract data from a seized device that contains NAND flash memory. Many such devices are physically damaged, preventing investigators from using automated techniques to extract the data stored within the device. Instead, investigators turn to chip-off analysis, where they use a thermal-based procedure to physically remove the NAND flash memory chip from the device, and access the chip directly to extract the raw data stored on the chip.

We perform an analysis of the errors introduced into multi-level cell (MLC) NAND flash memory chips after the device has been seized. We make two major observations. First, between the time that a device is seized and the time digital forensic investigators perform data extraction, a large number of errors can be introduced as a result of charge leakage from the cells of the NAND flash memory (known as data retention errors). Second, when thermal-based chip removal is performed, the number of errors in the data stored within NAND flash memory can increase by two or more orders of magnitude, as the high temperature applied to the chip greatly accelerates charge leakage. We demonstrate that the chip-off analysis based forensic data recovery procedure is quite destructive, and can often render most of the data within NAND flash memory uncorrectable, and, thus, unrecoverable.

To mitigate the errors introduced during the forensic recovery process, we explore a new hardware- based approach. We exploit a fine-grained read reference voltage control mechanism implemented in modern NAND flash memory chips, called read-retry, which can compensate for the charge leakage that occurs due to (1) retention loss and (2) thermal-based chip removal. The read-retry mechanism successfully reduces the number of errors, such that the original data can be fully recovered in our tested chips as long as the chips were not heavily used prior to seizure. We conclude that the read-retry mechanism should be adopted as part of the forensic data recovery process. },
	author = {Aya Fukami and Saugata Ghose and Yixin Luo and Yu Cai and Onur Mutlu},
	booktitle = {Proceedings of the 2017 Digital Forensics Conference},
	title = {Improving the Reliability of Chip-Off Forensic Analysis of NAND Flash Memory Devices},
	venue = {{\"U}berlingen, Germany},
	year = {2017}
}
14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 2017
@inproceedings{abc,
	author = {Kevin Hsieh and Aaron Harlap and Nandita Vijaykumar and Dimitris Konomis and Gregory R. Ganger and Phillip B. Gibbons and Onur Mutlu},
	booktitle = {14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA},
	title = {Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.},
	url = {https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh},
	year = {2017}
}
Design, Automation Test in Europe Conference Exhibition, DATE 2017, Lausanne, Switzerland, March 2017
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {Design, Automation  Test in Europe Conference  Exhibition, DATE 2017, Lausanne, Switzerland},
	title = {The RowHammer problem and other issues we may face as memory becomes denser.},
	url = {https://doi.org/10.23919/DATE.2017.7927156},
	year = {2017}
}
2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 2017
@inproceedings{abc,
	author = {Hasan Hassan and Nandita Vijaykumar and Samira Manabi Khan and Saugata Ghose and Kevin K. Chang and Gennady Pekhimenko and Donghyuk Lee and Oguz Ergin and Onur Mutlu},
	booktitle = {2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA},
	title = {SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.},
	url = {https://doi.org/10.1109/HPCA.2017.62},
	year = {2017}
}
2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 2017
@inproceedings{abc,
	author = {Yu Cai and Saugata Ghose and Yixin Luo and Ken Mai and Onur Mutlu and Erich F. Haratsch},
	booktitle = {2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA},
	title = {Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques.},
	url = {https://doi.org/10.1109/HPCA.2017.61},
	year = {2017}
}
POMACS, January 2017
@article{abc,
	author = {Kevin K. Chang and A. Giray Yaalik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	journal = {POMACS},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.},
	url = {http://doi.acm.org/10.1145/3084447},
	year = {2017}
}
POMACS, January 2017
@article{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Saugata Ghose and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu},
	journal = {POMACS},
	title = {Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.},
	url = {http://doi.acm.org/10.1145/3084464},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Onur Mutlu},
	journal = {CoRR},
	title = {The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser.},
	url = {http://arxiv.org/abs/1703.00626},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Xiangyao Yu and Christopher J. Hughes and Nadathur Satish and Onur Mutlu and Srinivas Devadas},
	journal = {CoRR},
	title = {Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation.},
	url = {http://arxiv.org/abs/1704.02677},
	year = {2017}
}
Computer Architecture Letters, January 2017
@inproceedings{abc,
	author = {Amirali Boroumand and Saugata Ghose and Minesh Patel and Hasan Hassan and Brandon Lucia and Kevin Hsieh and Krishna T. Malladi and Hongzhong Zheng and Onur Mutlu},
	booktitle = {Computer Architecture Letters},
	title = {LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory.},
	url = {https://doi.org/10.1109/LCA.2016.2577557},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Yu Cai and Saugata Ghose and Erich F. Haratsch and Yixin Luo and Onur Mutlu},
	journal = {CoRR},
	title = {Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives.},
	url = {http://arxiv.org/abs/1706.08642},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Yixin Luo and Saugata Ghose and Tianshi Li and Sriram Govindan and Bikash Sharma and Bryan Kelly and Amirali Boroumand and Onur Mutlu},
	journal = {CoRR},
	title = {Using ECC DRAM to Adaptively Increase Memory Capacity.},
	url = {http://arxiv.org/abs/1706.08870},
	year = {2017}
}
CoRR, January 2017
@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Long Sun and Onur Mutlu},
	booktitle = {CoRR},
	title = {Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency.},
	url = {http://arxiv.org/abs/1705.03623},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Amirali Boroumand and Saugata Ghose and Minesh Patel and Hasan Hassan and Brandon Lucia and Nastaran Hajinazar and Kevin Hsieh and Krishna T. Malladi and Hongzhong Zheng and Onur Mutlu},
	journal = {CoRR},
	title = {LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures.},
	url = {http://arxiv.org/abs/1706.03162},
	year = {2017}
}
CoRR, January 2017
@inproceedings{abc,
	author = {Kevin K. Chang and Abdullah Giray Yaglik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	booktitle = {CoRR},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Chips: Characterization, Analysis, and Mechanisms.},
	url = {http://arxiv.org/abs/1705.10292},
	year = {2017}
}
Advances in Computers, January 2017
@article{abc,
	author = {Vivek Seshadri and Onur Mutlu},
	journal = {Advances in Computers},
	title = {Chapter Four - Simple Operations in Memory to Reduce Data Movement.},
	url = {https://doi.org/10.1016/bs.adcom.2017.04.004},
	year = {2017}
}

2016

4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, INFLOW@OSDI 2016, Savannah, GA, USA, November 2016
@inproceedings{abc,
	author = {Himanshu Chauhan and Irina Calciu and Vijay Chidambaram and Eric Schkufza and Onur Mutlu and Pratap Subrahmanyam},
	booktitle = {4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, INFLOW@OSDI 2016, Savannah, GA, USA},
	title = {NVMOVE: Helping Programmers Move to Byte-Based Persistence.},
	url = {https://www.usenix.org/conference/inflow16/workshop-program/presentation/chauhan},
	year = {2016}
}
12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2016
@inproceedings{abc,
	author = {Khanh Nguyen and Lu Fang and Guoqing (Harry) Xu and Brian Demsky and Shan Lu and Sanazsadat Alamian and Onur Mutlu},
	booktitle = {12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA},
	title = {Yak: A High-Performance Big-Data-Friendly Garbage Collector.},
	url = {https://www.usenix.org/conference/osdi16/technical-sessions/presentation/nguyen},
	year = {2016}
}
2016 International Symposium on Rapid System Prototyping, RSP 2016, Pittsburg, PA, USA, October 2016
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {2016 International Symposium on Rapid System Prototyping, RSP 2016, Pittsburg, PA, USA},
	title = {Keynote: rethinking memory system design.},
	url = {https://doi.org/10.1145/2990299.2990300},
	year = {2016}
}
34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA, October 2016
@inproceedings{abc,
	author = {Kevin Hsieh and Samira Manabi Khan and Nandita Vijaykumar and Kevin K. Chang and Amirali Boroumand and Saugata Ghose and Onur Mutlu},
	booktitle = {34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA},
	title = {Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation.},
	url = {http://dx.doi.org/10.1109/ICCD.2016.7753257},
	year = {2016}
}
34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA, October 2016
@inproceedings{abc,
	author = {Xi-Yue Xiang and Saugata Ghose and Onur Mutlu and Nian-Feng Tzeng},
	booktitle = {34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA},
	title = {A model for Application Slowdown Estimation in on-chip networks and its use for improving system fairness and performance.},
	url = {http://dx.doi.org/10.1109/ICCD.2016.7753327},
	year = {2016}
}
49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 2016
@inproceedings{abc,
	author = {Nandita Vijaykumar and Kevin Hsieh and Gennady Pekhimenko and Samira Manabi Khan and Ashish Shrestha and Saugata Ghose and Adwait Jog and Phillip B. Gibbons and Onur Mutlu},
	booktitle = {49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan},
	title = {Zorua: A holistic approach to resource virtualization in GPUs.},
	url = {http://dx.doi.org/10.1109/MICRO.2016.7783718},
	year = {2016}
}
49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 2016
@inproceedings{abc,
	author = {Milad Hashemi and Onur Mutlu and Yale N. Patt},
	booktitle = {49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan},
	title = {Continuous runahead: Transparent hardware acceleration for memory intensive workloads.},
	url = {http://dx.doi.org/10.1109/MICRO.2016.7783764},
	year = {2016}
}
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 2016
@inproceedings{abc,
	author = {Ashutosh Pattnaik and Xulong Tang and Adwait Jog and Onur Kayiran and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Chita R. Das},
	booktitle = {Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel},
	title = {Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities.},
	url = {http://doi.acm.org/10.1145/2967938.2967940},
	year = {2016}
}
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 2016
@inproceedings{abc,
	author = {Onur Kayiran and Adwait Jog and Ashutosh Pattnaik and Rachata Ausavarungnirun and Xulong Tang and Mahmut T. Kandemir and Gabriel H. Loh and Onur Mutlu and Chita R. Das},
	booktitle = {Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel},
	title = {{\^I}{\textonequarter}C-States: Fine-grained GPU Datapath Power Management.},
	url = {http://doi.acm.org/10.1145/2967938.2967941},
	year = {2016}
}
46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016, Toulouse, France, June 2016
@inproceedings{abc,
	author = {Samira Manabi Khan and Donghyuk Lee and Onur Mutlu},
	booktitle = {46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016, Toulouse, France},
	title = {PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM.},
	url = {http://dx.doi.org/10.1109/DSN.2016.30},
	year = {2016}
}
Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France, June 2016
@inproceedings{abc,
	author = {Kevin K. Chang and Abhijith Kashyap and Hasan Hassan and Saugata Ghose and Kevin Hsieh and Donghyuk Lee and Tianshi Li and Gennady Pekhimenko and Samira Manabi Khan and Onur Mutlu},
	booktitle = {Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France},
	title = {Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization.},
	url = {http://doi.acm.org/10.1145/2896377.2901453},
	year = {2016}
}
Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France, June 2016
@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Ashutosh Pattnaik and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France},
	title = {Exploiting Core Criticality for Enhanced GPU Performance.},
	url = {http://doi.acm.org/10.1145/2896377.2901468},
	year = {2016}
}
43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 2016
@inproceedings{abc,
	author = {Kevin Hsieh and Eiman Ebrahimi and Gwangsun Kim and Niladrish Chatterjee and Mike O{\textquoteright}Connor and Nandita Vijaykumar and Onur Mutlu and Stephen W. Keckler},
	booktitle = {43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea},
	title = {Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ISCA.2016.27},
	year = {2016}
}
43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 2016
@inproceedings{abc,
	author = {Milad Hashemi and Khubaib and Eiman Ebrahimi and Onur Mutlu and Yale N. Patt},
	booktitle = {43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea},
	title = {Accelerating Dependent Cache Misses with an Enhanced Memory Controller.},
	url = {http://dx.doi.org/10.1109/ISCA.2016.46},
	year = {2016}
}
Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Austin, TX, USA, June 2016
@inproceedings{abc,
	author = {Wayne P. Burleson and Onur Mutlu and Mohit Tiwari},
	booktitle = {Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Austin, TX, USA},
	title = {Invited - Who is the major threat to tomorrow{\textquoteright}s security?: you, the hardware designer.},
	url = {http://doi.acm.org/10.1145/2897937.2905022},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Gennady Pekhimenko and Evgeny Bolotin and Nandita Vijaykumar and Onur Mutlu and Todd C. Mowry and Stephen W. Keckler},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {A case for toggle-aware compression for GPU systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446064},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Hasan Hassan and Gennady Pekhimenko and Nandita Vijaykumar and Vivek Seshadri and Donghyuk Lee and Oguz Ergin and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {ChargeCache: Reducing DRAM latency by exploiting row access locality.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446096},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Yang Li and Di Wang and Saugata Ghose and Jie Liu and Sriram Govindan and Sean James and Eric Peterson and John Siegler and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {SizeCap: Efficiently handling power surges in fuel cell powered data centers.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446085},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Kevin K. Chang and Prashant J. Nair and Donghyuk Lee and Saugata Ghose and Moinuddin K. Qureshi and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446095},
	year = {2016}
}
IEEE Micro, January 2016
@article{abc,
	author = {Onur Mutlu and Richard A. Belgard and Nick Tredennick and Mike Schlansker},
	journal = {IEEE Micro},
	title = {The 2014 MICRO Test of Time Award Winners: From 1978 to 1992.},
	url = {http://dx.doi.org/10.1109/MM.2016.7},
	year = {2016}
}
Bioinformatics, January 2016
@inproceedings{abc,
	author = {Hongyi Xin and Sunny Nahar and Richard Zhu and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	booktitle = {Bioinformatics},
	title = {Optimal seed solver: optimizing seed selection in read mapping.},
	url = {http://dx.doi.org/10.1093/bioinformatics/btv670},
	year = {2016}
}
TACO, January 2016
@inproceedings{abc,
	author = {Amir Yazdanbakhsh and Gennady Pekhimenko and Bradley Thwaites and Hadi Esmaeilzadeh and Onur Mutlu and Todd C. Mowry},
	booktitle = {TACO},
	title = {RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads.},
	url = {http://doi.acm.org/10.1145/2836168},
	year = {2016}
}
IEEE Journal on Selected Areas in Communications, January 2016
@inproceedings{abc,
	author = {Yixin Luo and Saugata Ghose and Yu Cai and Erich F. Haratsch and Onur Mutlu},
	booktitle = {IEEE Journal on Selected Areas in Communications},
	title = {Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory.},
	url = {http://dx.doi.org/10.1109/JSAC.2016.2603608},
	year = {2016}
}
Real-Time Systems, January 2016
@inproceedings{abc,
	author = {Hyoseung Kim and Dionisio de Niz and Bj{\"o}rn Andersson and Mark H. Klein and Onur Mutlu and Ragunathan Rajkumar},
	booktitle = {Real-Time Systems},
	title = {Bounding and reducing memory interference in COTS-based multi-core systems.},
	url = {http://dx.doi.org/10.1007/s11241-016-9248-1},
	year = {2016}
}
TACO, January 2016
@inproceedings{abc,
	author = {Donghyuk Lee and Saugata Ghose and Gennady Pekhimenko and Samira Manabi Khan and Onur Mutlu},
	booktitle = {TACO},
	title = {Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost.},
	url = {http://doi.acm.org/10.1145/2832911},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Vivek Seshadri and Onur Mutlu},
	journal = {CoRR},
	title = {The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Initialization, Bitwise AND and OR.},
	url = {http://arxiv.org/abs/1610.09603},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Donghyuk Lee and Yoongu Kim and Gennady Pekhimenko and Samira Manabi Khan and Vivek Seshadri and Kevin Kai-Wei Chang and Onur Mutlu},
	journal = {CoRR},
	title = {Adaptive-Latency DRAM (AL-DRAM).},
	url = {http://arxiv.org/abs/1603.08454},
	year = {2016}
}
TACO, January 2016
@inproceedings{abc,
	author = {Hiroyuki Usui and Lavanya Subramanian and Kevin Kai-Wei Chang and Onur Mutlu},
	booktitle = {TACO},
	title = {DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators.},
	url = {http://doi.acm.org/10.1145/2847255},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Saugata Ghose and Onur Mutlu},
	journal = {CoRR},
	title = {Reducing DRAM Latency by Exploiting Design-Induced Latency Variation in Modern DRAM Chips.},
	url = {http://arxiv.org/abs/1610.09604},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Yoongu Kim and Ross Daly and Jeremie Kim and Chris Fallin and Ji-Hye Lee and Donghyuk Lee and Chris Wilkerson and Konrad Lai and Onur Mutlu},
	journal = {CoRR},
	title = {RowHammer: Reliability Analysis and Security Implications.},
	url = {http://arxiv.org/abs/1603.00747},
	year = {2016}
}
Computer Architecture Letters, January 2016
@inproceedings{abc,
	author = {Yoongu Kim and Weikun Yang and Onur Mutlu},
	booktitle = {Computer Architecture Letters},
	title = {Ramulator: A Fast and Extensible DRAM Simulator.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2414456},
	year = {2016}
}
IEEE Design Test, January 2016
@article{abc,
	author = {Amir Yazdanbakhsh and Bradley Thwaites and Hadi Esmaeilzadeh and Gennady Pekhimenko and Onur Mutlu and Todd C. Mowry},
	journal = {IEEE Design  Test},
	title = {Mitigating the Memory Bottleneck With Approximate Load Value Prediction.},
	url = {http://dx.doi.org/10.1109/MDAT.2015.2504899},
	year = {2016}
}
IEEE Micro, January 2016
@inproceedings{abc,
	author = {Onur Mutlu and Richard A. Belgard and Thomas R. Gross and Norman P. Jouppi and John L. Hennessy and Steven A. Przybylski and Chris Rowen and Yale N. Patt and Wen-Mei W. Hwu and Stephen W. Melvin and Michael Shebanow and Tse-Yu Yeh and Andy Wolfe},
	booktitle = {IEEE Micro},
	title = {Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC Processor.},
	url = {http://dx.doi.org/10.1109/MM.2016.66},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Kevin Kai-Wei Chang and Donghyuk Lee and Zeshan Chishti and Alaa R. Alameldeen and Chris Wilkerson and Yoongu Kim and Onur Mutlu},
	journal = {CoRR},
	title = {Reducing Performance Impact of DRAM Refresh by Parallelizing Refreshes with Accesses.},
	url = {http://arxiv.org/abs/1601.06352},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Donghyuk Lee and Yoongu Kim and Vivek Seshadri and Jamie Liu and Lavanya Subramanian and Onur Mutlu},
	journal = {CoRR},
	title = {Tiered-Latency DRAM (TL-DRAM).},
	url = {http://arxiv.org/abs/1601.06903},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Yixin Luo and Sriram Govindan and Bikash Sharma and Mark Santaniello and Justin Meza and Aman Kansal and Jie Liu and Badriddine M. Khessib and Kushagra Vaid and Onur Mutlu},
	journal = {CoRR},
	title = {Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance.},
	url = {http://arxiv.org/abs/1602.00729},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	journal = {CoRR},
	title = {Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing.},
	url = {http://arxiv.org/abs/1602.06005},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Vivek Seshadri and Donghyuk Lee and Thomas Mullins and Hasan Hassan and Amirali Boroumand and Jeremie Kim and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
	journal = {CoRR},
	title = {Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM.},
	url = {http://arxiv.org/abs/1611.09988},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Mohammed Alser and Hasan Hassan and Hongyi Xin and Oguz Ergin and Onur Mutlu and Can Alkan},
	journal = {CoRR},
	title = {GateKeeper: Enabling Fast Pre-Alignment in DNA Short Read Mapping with a New Streaming Accelerator Architecture.},
	url = {http://arxiv.org/abs/1604.01789},
	year = {2016}
}
CoRR, January 2016
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.
@article{abc,
	abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. 
This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. 
CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. 
We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
	author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Saugata Ghose and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
	journal = {CoRR},
	title = {A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps.},
	url = {http://arxiv.org/abs/1602.01348},
	year = {2016}
}
Parallel Computing, January 2016
@article{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	journal = {Parallel Computing},
	title = {A case for hierarchical rings with deflection routing: An energy-efficient on-chip communication substrate.},
	url = {http://dx.doi.org/10.1016/j.parco.2016.01.009},
	year = {2016}
}
IEEE Trans. Parallel Distrib. Syst., January 2016
@inproceedings{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	booktitle = {IEEE Trans. Parallel Distrib. Syst.},
	title = {BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling.},
	url = {http://dx.doi.org/10.1109/TPDS.2016.2526003},
	year = {2016}
}
CoRR, January 2016
@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Gabriel H. Loh and Mithuna Thottethodi and Yasuko Eckert and Mike O{\textquoteright}Connor and Srilatha Manne and Lisa Hsu and Lavanya Subramanian and Onur Mutlu},
	booktitle = {CoRR},
	title = {Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism.},
	url = {http://arxiv.org/abs/1602.00722},
	year = {2016}
}

2015

Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015
@inproceedings{abc,
	author = {Lavanya Subramanian and Vivek Seshadri and Arnab Ghosh and Samira Manabi Khan and Onur Mutlu},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory.},
	url = {http://doi.acm.org/10.1145/2830772.2830803},
	year = {2015}
}
Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015
@inproceedings{abc,
	author = {Vivek Seshadri and Thomas Mullins and Amirali Boroumand and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses.},
	url = {http://doi.acm.org/10.1145/2830772.2830820},
	year = {2015}
}
Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015
@inproceedings{abc,
	author = {Jinglei Ren and Jishen Zhao and Samira Manabi Khan and Jongmoo Choi and Yongwei Wu and Onur Mutlu},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {ThyNVM: enabling software-transparent crash consistency in persistent memory systems.},
	url = {http://doi.acm.org/10.1145/2830772.2830802},
	year = {2015}
}
Proceedings of the 8th International Workshop on Network on Chip Architectures, NoCArc '15, Waikiki, HI, USA, December 2015
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {Proceedings of the 8th International Workshop on Network on Chip Architectures, NoCArc {\textquoteright}15, Waikiki, HI, USA},
	title = {Rethinking Memory System Design (along with Interconnects).},
	url = {http://doi.acm.org/10.1145/2835512.2835520},
	year = {2015}
}
2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 2015
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Saugata Ghose and Onur Kayiran and Gabriel H. Loh and Chita R. Das and Mahmut T. Kandemir and Onur Mutlu},
	booktitle = {2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA},
	title = {Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance.},
	url = {http://dx.doi.org/10.1109/PACT.2015.38},
	year = {2015}
}
2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 2015
@inproceedings{abc,
	author = {Donghyuk Lee and Lavanya Subramanian and Rachata Ausavarungnirun and Jongmoo Choi and Onur Mutlu},
	booktitle = {2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA},
	title = {Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM.},
	url = {http://dx.doi.org/10.1109/PACT.2015.51},
	year = {2015}
}
Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS 2015, Vancouver, BC, Canada, September 2015
@inproceedings{abc,
	author = {Mohammad Fattah and Antti Airola and Rachata Ausavarungnirun and Nima Mirzaei and Pasi Liljeberg and Juha Plosila and Siamak Mohammadi and Tapio Pahikkala and Onur Mutlu and Hannu Tenhunen},
	booktitle = {Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS 2015, Vancouver, BC, Canada},
	title = {A Low-Overhead, Fully-Distributed, Guaranteed-Delivery Routing Algorithm for Faulty Network-on-Chips.},
	url = {http://doi.acm.org/10.1145/2786572.2786591},
	year = {2015}
}
2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2015, Samos, Greece, July 2015
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2015, Samos, Greece},
	title = {Rethinking memory system design for data-intensive computing.},
	url = {http://dx.doi.org/10.1109/SAMOS.2015.7363650},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Vivek Seshadri and Gennady Pekhimenko and Olatunji Ruwase and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry and Trishul M. Chilimbi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.},
	url = {http://doi.acm.org/10.1145/2749469.2750379},
	year = {2015}
}
45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015
@inproceedings{abc,
	author = {Justin Meza and Qiang Wu and Sanjeev Kumar and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field.},
	url = {http://dx.doi.org/10.1109/DSN.2015.57},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Junwhan Ahn and Sungpack Hong and Sungjoo Yoo and Onur Mutlu and Kiyoung Choi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {A scalable processing-in-memory accelerator for parallel graph processing.},
	url = {http://doi.acm.org/10.1145/2749469.2750386},
	year = {2015}
}
45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015
@inproceedings{abc,
	author = {Yu Cai and Yixin Luo and Saugata Ghose and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery.},
	url = {http://dx.doi.org/10.1109/DSN.2015.49},
	year = {2015}
}
45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015
@inproceedings{abc,
	author = {Moinuddin K. Qureshi and Dae-Hyun Kim and Samira Manabi Khan and Prashant J. Nair and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems.},
	url = {http://dx.doi.org/10.1109/DSN.2015.58},
	year = {2015}
}
Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Justin Meza and Qiang Wu and Sanjev Kumar and Onur Mutlu},
	booktitle = {Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA},
	title = {A Large-Scale Study of Flash Memory Failures in the Field.},
	url = {http://doi.acm.org/10.1145/2745844.2745848},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate “assist warps” that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.
@inproceedings{abc,
	abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the 
 cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate {\textquotedblleft}assist warps{\textquotedblright} that execute on GPU cores to perform specific tasks that can improve GPU performance and 
 efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
	author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture},
	title = {A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.},
	url = {http://doi.acm.org/10.1145/2749469.2750399},
	venue = {Portland, OR, USA},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Junwhan Ahn and Sungjoo Yoo and Onur Mutlu and Kiyoung Choi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture.},
	url = {http://doi.acm.org/10.1145/2749469.2750385},
	year = {2015}
}
IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA, May 2015
@inproceedings{abc,
	author = {Yixin Luo and Yu Cai and Saugata Ghose and Jongmoo Choi and Onur Mutlu},
	booktitle = {IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA},
	title = {WARM: Improving NAND flash memory lifetime with write-hotness aware retention management.},
	url = {http://dx.doi.org/10.1109/MSST.2015.7208284},
	year = {2015}
}
IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA, May 2015
@inproceedings{abc,
	author = {Dongwoo Kang and Seungjae Baek and Jongmoo Choi and Donghee Lee and Sam H. Noh and Onur Mutlu},
	booktitle = {IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA},
	title = {Amnesic cache management for non-volatile memory.},
	url = {http://dx.doi.org/10.1109/MSST.2015.7208291},
	year = {2015}
}
Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey, March 2015
@inproceedings{abc,
	author = {Hui Wang and Canturk Isci and Lavanya Subramanian and Jongmoo Choi and Depei Qian and Onur Mutlu},
	booktitle = {Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey},
	title = {A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters.},
	url = {http://doi.acm.org/10.1145/2731186.2731202},
	year = {2015}
}
Sixteenth International Symposium on Quality Electronic Design, ISQED 2015, Santa Clara, CA, USA, March 2015
@inproceedings{abc,
	author = {Yu Cai and Ken Mai and Onur Mutlu},
	booktitle = {Sixteenth International Symposium on Quality Electronic Design, ISQED 2015, Santa Clara, CA, USA},
	title = {Comparative evaluation of FPGA and ASIC implementations of bufferless and buffered routing algorithms for on-chip networks.},
	url = {http://dx.doi.org/10.1109/ISQED.2015.7085472},
	year = {2015}
}
21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015
@inproceedings{abc,
	author = {Gennady Pekhimenko and Tyler Huberty and Rui Cai and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Exploiting compressed block size as an indicator of future reuse.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056021},
	year = {2015}
}
21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015
@inproceedings{abc,
	author = {Yu Cai and Yixin Luo and Erich F. Haratsch and Ken Mai and Onur Mutlu},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Data retention in MLC NAND flash memory: Characterization, optimization, and recovery.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056062},
	year = {2015}
}
21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015
@inproceedings{abc,
	author = {Donghyuk Lee and Yoongu Kim and Gennady Pekhimenko and Samira Manabi Khan and Vivek Seshadri and Kevin Kai-Wei Chang and Onur Mutlu},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056057},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Hiroyuki Usui and Lavanya Subramanian and Kevin Kai-Wei Chang and Onur Mutlu},
	journal = {CoRR},
	title = {SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators.},
	url = {http://arxiv.org/abs/1505.07502},
	year = {2015}
}
Computer Architecture Letters, January 2015
@inproceedings{abc,
	author = {Gennady Pekhimenko and Evgeny Bolotin and Mike O{\textquoteright}Connor and Onur Mutlu and Todd C. Mowry and Stephen W. Keckler},
	booktitle = {Computer Architecture Letters},
	title = {Toggle-Aware Compression for GPUs.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2430853},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Hongyi Xin and Richard Zhu and Sunny Nahar and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	journal = {CoRR},
	title = {Optimal Seed Solver: Optimizing Seed Selection in Read Mapping.},
	url = {http://arxiv.org/abs/1506.08235},
	year = {2015}
}
Computer Architecture Letters, January 2015
@inproceedings{abc,
	author = {Vivek Seshadri and Kevin Hsieh and Amirali Boroumand and Donghyuk Lee and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
	booktitle = {Computer Architecture Letters},
	title = {Fast Bulk Bitwise AND and OR in DRAM.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2434872},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Donghyuk Lee and Gennady Pekhimenko and Samira Manabi Khan and Saugata Ghose and Onur Mutlu},
	journal = {CoRR},
	title = {Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface.},
	url = {http://arxiv.org/abs/1506.03160},
	year = {2015}
}
IEEE Micro, January 2015
@inproceedings{abc,
	author = {Onur Mutlu and Richard A. Belgard},
	booktitle = {IEEE Micro},
	title = {Introducing the MICRO Test of Time Awards: Concept, Process, 2014 Winners, and the Future.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2015.32},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Yang Li and Jongmoo Choi and Jin Sun and Saugata Ghose and Hui Wang and Justin Meza and Jinglei Ren and Onur Mutlu},
	journal = {CoRR},
	title = {Managing Hybrid Main Memories with a Page-Utility Driven Performance Model.},
	url = {http://arxiv.org/abs/1507.03303},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	journal = {CoRR},
	title = {The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity.},
	url = {http://arxiv.org/abs/1504.00390},
	year = {2015}
}
Bioinformatics, January 2015
@inproceedings{abc,
	author = {Hongyi Xin and John Greth and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	booktitle = {Bioinformatics},
	title = {Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping.},
	url = {http://dx.doi.org/10.1093/bioinformatics/btu856},
	year = {2015}
}
IEEE Trans. Computers, January 2015
@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Jia Guo and Shuai Li and Onur Mutlu},
	booktitle = {IEEE Trans. Computers},
	title = {High-Performance and Lightweight Transaction Support in Flash-Based SSDs.},
	url = {http://dx.doi.org/10.1109/TC.2015.2389828},
	year = {2015}
}

2014

47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 2014
@inproceedings{abc,
	author = {Jishen Zhao and Onur Mutlu and Yuan Xie},
	booktitle = {47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom},
	title = {FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems.},
	url = {http://dx.doi.org/10.1109/MICRO.2014.47},
	year = {2014}
}
47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 2014
@inproceedings{abc,
	author = {Onur Kayiran and Nachiappan Chidambaram Nachiappan and Adwait Jog and Rachata Ausavarungnirun and Mahmut T. Kandemir and Gabriel H. Loh and Onur Mutlu and Chita R. Das},
	booktitle = {47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom},
	title = {Managing GPU Concurrency in Heterogeneous Architectures.},
	url = {http://dx.doi.org/10.1109/MICRO.2014.62},
	year = {2014}
}
32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014
@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Long Sun and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {Loose-Ordering Consistency for persistent memory.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974684},
	year = {2014}
}
32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014
@inproceedings{abc,
	author = {Chris Fallin and Chris Wilkerson and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {The heterogeneous block architecture.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974710},
	year = {2014}
}
32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014
@inproceedings{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974655},
	year = {2014}
}
26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 2014
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	booktitle = {26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France},
	title = {Design and Evaluation of Hierarchical Rings with Deflection Routing.},
	url = {http://dx.doi.org/10.1109/SBAC-PAD.2014.31},
	year = {2014}
}
International Conference on Parallel Architectures and Compilation, PACT '14, Edmonton, AB, Canada, August 2014
@inproceedings{abc,
	author = {James A. Jablin and Thomas B. Jablin and Onur Mutlu and Maurice Herlihy},
	booktitle = {International Conference on Parallel Architectures and Compilation, PACT {\textquoteright}14, Edmonton, AB, Canada},
	title = {Warp-aware trace scheduling for GPUs.},
	url = {http://doi.acm.org/10.1145/2628071.2628101},
	year = {2014}
}
International Conference on Parallel Architectures and Compilation, PACT '14, Edmonton, AB, Canada, August 2014
@inproceedings{abc,
	author = {Bradley Thwaites and Gennady Pekhimenko and Hadi Esmaeilzadeh and Amir Yazdanbakhsh and Onur Mutlu and Jongse Park and Girish Mururu and Todd C. Mowry},
	booktitle = {International Conference on Parallel Architectures and Compilation, PACT {\textquoteright}14, Edmonton, AB, Canada},
	title = {Rollback-free value prediction with approximate loads.},
	url = {http://doi.acm.org/10.1145/2628071.2628110},
	year = {2014}
}
ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, Austin, TX, June 2014
@inproceedings{abc,
	author = {Samira Manabi Khan and Donghyuk Lee and Yoongu Kim and Alaa R. Alameldeen and Chris Wilkerson and Onur Mutlu},
	booktitle = {ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS {\textquoteright}14, Austin, TX},
	title = {The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study.},
	url = {http://doi.acm.org/10.1145/2591971.2592000},
	year = {2014}
}
ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, Austin, TX, June 2014
@inproceedings{abc,
	author = {Yu Cai and Gulay Yalcin and Onur Mutlu and Erich F. Haratsch and Osman S. Unsal and Adri{\'a}n Cristal and Ken Mai},
	booktitle = {ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS {\textquoteright}14, Austin, TX},
	title = {Neighbor-cell assisted error correction for MLC NAND flash memories.},
	url = {http://doi.acm.org/10.1145/2591971.2591994},
	year = {2014}
}
ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 2014
@inproceedings{abc,
	author = {Yoongu Kim and Ross Daly and Jeremie Kim and Chris Fallin and Ji-Hye Lee and Donghyuk Lee and Chris Wilkerson and Konrad Lai and Onur Mutlu},
	booktitle = {ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA},
	title = {Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors.},
	url = {http://dx.doi.org/10.1109/ISCA.2014.6853210},
	year = {2014}
}
ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 2014
@inproceedings{abc,
	author = {Vivek Seshadri and Abhishek Bhowmick and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA},
	title = {The Dirty-Block Index.},
	url = {http://dx.doi.org/10.1109/ISCA.2014.6853204},
	year = {2014}
}
44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 2014
@inproceedings{abc,
	author = {Yixin Luo and Sriram Govindan and Bikash Sharma and Mark Santaniello and Justin Meza and Aman Kansal and Jie Liu and Badriddine M. Khessib and Kushagra Vaid and Onur Mutlu},
	booktitle = {44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA},
	title = {Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory.},
	url = {http://dx.doi.org/10.1109/DSN.2014.50},
	year = {2014}
}
20th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2014, Berlin, Germany, April 2014
@inproceedings{abc,
	author = {Hyoseung Kim and Dionisio de Niz and Bj{\"o}rn Andersson and Mark H. Klein and Onur Mutlu and Ragunathan Rajkumar},
	booktitle = {20th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2014, Berlin, Germany},
	title = {Bounding memory interference delay in COTS-based multi-core systems.},
	url = {http://dx.doi.org/10.1109/RTAS.2014.6925998},
	year = {2014}
}
20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 2014
@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Donghyuk Lee and Zeshan Chishti and Alaa R. Alameldeen and Chris Wilkerson and Yoongu Kim and Onur Mutlu},
	booktitle = {20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA},
	title = {Improving DRAM performance by parallelizing refreshes with accesses.},
	url = {http://dx.doi.org/10.1109/HPCA.2014.6835946},
	year = {2014}
}
20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 2014
@inproceedings{abc,
	author = {Samira Manabi Khan and Alaa R. Alameldeen and Chris Wilkerson and Onur Mutlu and Daniel A. Jim{\'e}nez},
	booktitle = {20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA},
	title = {Improving cache performance using read-write partitioning.},
	url = {http://dx.doi.org/10.1109/HPCA.2014.6835954},
	year = {2014}
}
TACO, January 2014
@inproceedings{abc,
	author = {Vivek Seshadri and Samihan Yedkar and Hongyi Xin and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {TACO},
	title = {Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks.},
	url = {http://doi.acm.org/10.1145/2677956},
	year = {2014}
}
TACO, January 2014
@inproceedings{abc,
	author = {HanBin Yoon and Justin Meza and Naveen Muralimanohar and Norman P. Jouppi and Onur Mutlu},
	booktitle = {TACO},
	title = {Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories.},
	url = {http://doi.acm.org/10.1145/2669365},
	year = {2014}
}

2013

The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 2013
@inproceedings{abc,
	author = {Vivek Seshadri and Yoongu Kim and Chris Fallin and Donghyuk Lee and Rachata Ausavarungnirun and Gennady Pekhimenko and Yixin Luo and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA},
	title = {RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.},
	url = {http://doi.acm.org/10.1145/2540708.2540725},
	year = {2013}
}
The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 2013
@inproceedings{abc,
	author = {Gennady Pekhimenko and Vivek Seshadri and Yoongu Kim and Hongyi Xin and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA},
	title = {Linearly compressed pages: a low-complexity, low-latency main memory compression framework.},
	url = {http://doi.acm.org/10.1145/2540708.2540724},
	year = {2013}
}
2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA, October 2013
@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Jia Guo and Shuai Li and Onur Mutlu},
	booktitle = {2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA},
	title = {LightTx: A lightweight transactional design in flash-based SSDs to support flexible transactions.},
	url = {http://dx.doi.org/10.1109/ICCD.2013.6657033},
	year = {2013}
}
2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA, October 2013
@inproceedings{abc,
	author = {Yu Cai and Onur Mutlu and Erich F. Haratsch and Ken Mai},
	booktitle = {2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA},
	title = {Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation.},
	url = {http://dx.doi.org/10.1109/ICCD.2013.6657034},
	year = {2013}
}
The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013
@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {Orchestrated scheduling and prefetching for GPGPUs.},
	url = {http://doi.acm.org/10.1145/2485922.2485951},
	year = {2013}
}
The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013
@inproceedings{abc,
	author = {Jamie Liu and Ben Jaiyen and Yoongu Kim and Chris Wilkerson and Onur Mutlu},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms.},
	url = {http://doi.acm.org/10.1145/2485922.2485928},
	year = {2013}
}
The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013
@inproceedings{abc,
	author = {Jos{\'e} A. Joao and M. Aater Suleman and Onur Mutlu and Yale N. Patt},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {Utility-based acceleration of multithreaded applications on asymmetric CMPs.},
	url = {http://doi.acm.org/10.1145/2485922.2485936},
	year = {2013}
}
The 50th Annual Design Automation Conference 2013, DAC '13, Austin, TX, USA, May 2013
@inproceedings{abc,
	author = {Asit K. Mishra and Onur Mutlu and Chita R. Das},
	booktitle = {The 50th Annual Design Automation Conference 2013, DAC {\textquoteright}13, Austin, TX, USA},
	title = {A heterogeneous multiple network-on-chip design: an application-aware approach.},
	url = {http://doi.acm.org/10.1145/2463209.2488779},
	year = {2013}
}
2012 IEEE International Symposium on Performance Analysis of Systems Software, Austin, TX, USA, April 2013
@inproceedings{abc,
	author = {Emre Kultursay and Mahmut T. Kandemir and Anand Sivasubramaniam and Onur Mutlu},
	booktitle = {2012 IEEE International Symposium on Performance Analysis of Systems  Software, Austin, TX, USA},
	title = {Evaluating STT-RAM as an energy-efficient main memory alternative.},
	url = {http://dx.doi.org/10.1109/ISPASS.2013.6557176},
	year = {2013}
}
2012 IEEE International Symposium on Performance Analysis of Systems Software, Austin, TX, USA, April 2013
@inproceedings{abc,
	author = {Chuanjun Zhang and Glenn G. Ko and Jungwook Choi and Shang-nien Tsai and Minje Kim and Abner Guzm{\'a}n-Rivera and Rob A. Rutenbar and Paris Smaragdis and Mi Sun Park and Narayanan Vijaykrishnan and Hongyi Xin and Onur Mutlu and Bin Li and Li Zhao and Mei Chen},
	booktitle = {2012 IEEE International Symposium on Performance Analysis of Systems  Software, Austin, TX, USA},
	title = {EMERALD: Characterization of emerging applications and algorithms for low-power devices.},
	url = {http://dx.doi.org/10.1109/ISPASS.2013.6557154},
	year = {2013}
}
Design, Automation and Test in Europe, DATE 13, Grenoble, France, March 2013
@inproceedings{abc,
	author = {Yu Cai and Erich F. Haratsch and Onur Mutlu and Ken Mai},
	booktitle = {Design, Automation and Test in Europe, DATE 13, Grenoble, France},
	title = {Threshold voltage distribution in MLC NAND flash memory: characterization, analysis, and modeling.},
	year = {2013}
}
Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, Houston, TX, March 2013
@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Nachiappan Chidambaram Nachiappan and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {Architectural Support for Programming Languages and Operating Systems, ASPLOS {\textquoteright}13, Houston, TX},
	title = {OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance.},
	url = {http://doi.acm.org/10.1145/2451116.2451158},
	year = {2013}
}
19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013
@inproceedings{abc,
	author = {Reetuparna Das and Rachata Ausavarungnirun and Onur Mutlu and Akhilesh Kumar and Mani Azimi},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {Application-to-core mapping policies to reduce memory system interference in multi-core systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522311},
	year = {2013}
}
19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013
@inproceedings{abc,
	author = {Donghyuk Lee and Yoongu Kim and Vivek Seshadri and Jamie Liu and Lavanya Subramanian and Onur Mutlu},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {Tiered-latency DRAM: A low latency and low cost DRAM architecture.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522354},
	year = {2013}
}
19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013
@inproceedings{abc,
	author = {Lavanya Subramanian and Vivek Seshadri and Yoongu Kim and Ben Jaiyen and Onur Mutlu},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {MISE: Providing performance predictability and improving fairness in shared main memory systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522356},
	year = {2013}
}
BMC Genomics, January 2013
@article{abc,
	author = {Hongyi Xin and Donghyuk Lee and Farhad Hormozdiari and Samihan Yedkar and Onur Mutlu and Can Alkan},
	journal = {BMC Genomics},
	title = {Accelerating read mapping with FastHASH.},
	url = {http://dx.doi.org/10.1186/1471-2164-14-S1-S13},
	year = {2013}
}

2012

IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, New York, NY, USA, October 2012
@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Rachata Ausavarungnirun and Chris Fallin and Onur Mutlu},
	booktitle = {IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, New York, NY, USA},
	title = {HAT: Heterogeneous Adaptive Throttling for On-Chip Networks.},
	url = {http://doi.ieeecomputersociety.org/10.1109/SBAC-PAD.2012.44},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Gennady Pekhimenko and Todd C. Mowry and Onur Mutlu},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Linearly compressed pages: a main memory compression framework with low complexity and low latency.},
	url = {http://doi.acm.org/10.1145/2370816.2370911},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Vivek Seshadri and Onur Mutlu and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {The evicted-address filter: a unified mechanism to address both cache pollution and thrashing.},
	url = {http://doi.acm.org/10.1145/2370816.2370868},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Base-delta-immediate compression: practical data compression for on-chip caches.},
	url = {http://doi.acm.org/10.1145/2370816.2370870},
	year = {2012}
}
30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012
@inproceedings{abc,
	author = {Yu Cai and Gulay Yalcin and Onur Mutlu and Erich F. Haratsch and Adri{\'a}n Cristal and Osman S. Unsal and Ken Mai},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICCD.2012.6378623},
	year = {2012}
}
30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012
@inproceedings{abc,
	author = {Justin Meza and Jing Li and Onur Mutlu},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {A case for small row buffers in non-volatile main memories.},
	url = {http://dx.doi.org/10.1109/ICCD.2012.6378685},
	year = {2012}
}
30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012
@inproceedings{abc,
	author = {HanBin Yoon and Justin Meza and Rachata Ausavarungnirun and Rachael Harding and Onur Mutlu},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {Row buffer locality aware caching policies for hybrid memories.},
	url = {http://dx.doi.org/10.1109/ICCD.2012.6378661},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Reetuparna Das and Rachata Ausavarungnirun and Onur Mutlu and Akhilesh Kumar and Mani Azimi},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Application-to-core mapping policies to reduce memory interference in multi-core systems.},
	url = {http://doi.acm.org/10.1145/2370816.2370893},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Nachiappan Chidambaram Nachiappan and Asit K. Mishra and Mahmut T. Kandemir and Anand Sivasubramaniam and Onur Mutlu and Chita R. Das},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Application-aware prefetch prioritization in on-chip networks.},
	url = {http://doi.acm.org/10.1145/2370816.2370886},
	year = {2012}
}
ACM SIGCOMM 2012 Conference, SIGCOMM '12, Helsinki, August 2012
@inproceedings{abc,
	author = {George Nychis and Chris Fallin and Thomas Moscibroda and Onur Mutlu and Srinivasan Seshan},
	booktitle = {ACM SIGCOMM 2012 Conference, SIGCOMM {\textquoteright}12, Helsinki},
	title = {On-chip networks from a networking perspective: congestion and scalability in many-core interconnects.},
	url = {http://doi.acm.org/10.1145/2342356.2342436},
	year = {2012}
}
39th International Symposium on Computer Architecture (ISCA 2012), Portland, OR, USA, June 2012
@inproceedings{abc,
	author = {Jamie Liu and Ben Jaiyen and Richard Veras and Onur Mutlu},
	booktitle = {39th International Symposium on Computer Architecture (ISCA 2012)},
	title = {RAIDR: Retention-aware intelligent DRAM refresh.},
	url = {http://dx.doi.org/10.1109/ISCA.2012.6237001},
	venue = {Portland, OR, USA},
	year = {2012}
}
39th International Symposium on Computer Architecture (ISCA 2012), Portland, OR, USA, June 2012
@inproceedings{abc,
	author = {Yoongu Kim and Vivek Seshadri and Donghyuk Lee and Jamie Liu and Onur Mutlu},
	booktitle = {39th International Symposium on Computer Architecture (ISCA 2012)},
	title = {A case for exploiting subarray-level parallelism (SALP) in DRAM.},
	url = {http://dx.doi.org/10.1109/ISCA.2012.6237032},
	venue = {Portland, OR, USA},
	year = {2012}
}
39th International Symposium on Computer Architecture (ISCA 2012), Portland, OR, USA, June 2012
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Kevin Kai-Wei Chang and Lavanya Subramanian and Gabriel H. Loh and Onur Mutlu},
	booktitle = {39th International Symposium on Computer Architecture (ISCA 2012)},
	title = {Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems.},
	url = {http://dx.doi.org/10.1109/ISCA.2012.6237036},
	venue = {Portland, OR, USA},
	year = {2012}
}
2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), Copenhagen, Denmark, May 2012
@inproceedings{abc,
	author = {Chris Fallin and Greg Nazario and Xiangyao Yu and Kevin Kai-Wei Chang and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), Copenhagen, Denmark},
	title = {MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect.},
	url = {http://dx.doi.org/10.1109/NOCS.2012.8},
	year = {2012}
}
Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK, March 2012
@inproceedings{abc,
	author = {Jos{\'e} A. Joao and M. Aater Suleman and Onur Mutlu and Yale N. Patt},
	booktitle = {Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK},
	title = {Bottleneck identification and scheduling in multithreaded applications.},
	url = {http://doi.acm.org/10.1145/2150976.2151001},
	year = {2012}
}
2012 Design, Automation Test in Europe Conference Exhibition, DATE 2012, Dresden, Germany, March 2012
@inproceedings{abc,
	author = {Yu Cai and Erich F. Haratsch and Onur Mutlu and Ken Mai},
	booktitle = {2012 Design, Automation  Test in Europe Conference  Exhibition, DATE 2012, Dresden, Germany},
	title = {Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis.},
	url = {http://dx.doi.org/10.1109/DATE.2012.6176524},
	year = {2012}
}
ACM Trans. Comput. Syst., January 2012
@inproceedings{abc,
	author = {Eiman Ebrahimi and Chang Joo Lee and Onur Mutlu and Yale N. Patt},
	booktitle = {ACM Trans. Comput. Syst.},
	title = {Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems.},
	url = {http://doi.acm.org/10.1145/2166879.2166881},
	year = {2012}
}
Computer Architecture Letters, January 2012
@inproceedings{abc,
	author = {Justin Meza and Jichuan Chang and HanBin Yoon and Onur Mutlu and Parthasarathy Ranganathan},
	booktitle = {Computer Architecture Letters},
	title = {Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management.},
	url = {http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.2},
	year = {2012}
}
IEEE Micro, January 2012
@inproceedings{abc,
	author = {Boris Grot and Joel Hestness and Stephen W. Keckler and Onur Mutlu},
	booktitle = {IEEE Micro},
	title = {A QoS-Enabled On-Die Interconnect Fabric for Kilo-Node Chips.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2012.18},
	year = {2012}
}

2011

44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 2011
@inproceedings{abc,
	author = {Eiman Ebrahimi and Rustam Miftakhutdinov and Chris Fallin and Chang Joo Lee and Jos{\'e} A. Joao and Onur Mutlu and Yale N. Patt},
	booktitle = {44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil},
	title = {Parallel application memory scheduling.},
	url = {http://doi.acm.org/10.1145/2155620.2155663},
	year = {2011}
}
44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 2011
@inproceedings{abc,
	author = {Veynu Narasiman and Michael Shebanow and Chang Joo Lee and Rustam Miftakhutdinov and Onur Mutlu and Yale N. Patt},
	booktitle = {44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil},
	title = {Improving GPU performance via large warps and two-level warp scheduling.},
	url = {http://doi.acm.org/10.1145/2155620.2155656},
	year = {2011}
}
44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 2011
@inproceedings{abc,
	author = {Sai Prashanth Muralidhara and Lavanya Subramanian and Onur Mutlu and Mahmut T. Kandemir and Thomas Moscibroda},
	booktitle = {44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil},
	title = {Reducing memory interference in multicore systems via application-aware memory channel partitioning.},
	url = {http://doi.acm.org/10.1145/2155620.2155664},
	year = {2011}
}
Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA, June 2011
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA},
	title = {Memory systems in the many-core era: challenges, opportunities, and solution directions.},
	url = {http://doi.acm.org/10.1145/1993478.1993489},
	year = {2011}
}
38th International Symposium on Computer Architecture (ISCA 2011), San Jose, CA, USA, June 2011
@inproceedings{abc,
	author = {Eiman Ebrahimi and Chang Joo Lee and Onur Mutlu and Yale N. Patt},
	booktitle = {38th International Symposium on Computer Architecture (ISCA 2011)},
	title = {Prefetch-aware shared resource management for multi-core systems.},
	url = {http://doi.acm.org/10.1145/2000064.2000081},
	venue = {San Jose, CA, USA},
	year = {2011}
}
38th International Symposium on Computer Architecture (ISCA 2011), San Jose, CA, USA, June 2011
@inproceedings{abc,
	author = {Boris Grot and Joel Hestness and Stephen W. Keckler and Onur Mutlu},
	booktitle = {38th International Symposium on Computer Architecture (ISCA 2011)},
	title = {Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees.},
	url = {http://doi.acm.org/10.1145/2000064.2000112},
	venue = {San Jose, CA, USA},
	year = {2011}
}
Proceedings of the 8th International Conference on Autonomic Computing, ICAC 2011, Karlsruhe, Germany, June 2011
@inproceedings{abc,
	author = {Howard David and Chris Fallin and Eugene Gorbatov and Ulf R. Hanebutte and Onur Mutlu},
	booktitle = {Proceedings of the 8th International Conference on Autonomic Computing, ICAC 2011, Karlsruhe, Germany},
	title = {Memory power management via dynamic voltage/frequency scaling.},
	url = {http://doi.acm.org/10.1145/1998582.1998590},
	year = {2011}
}
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 2011
@inproceedings{abc,
	author = {Licheng Chen and Yongbing Huang and Yungang Bao and Onur Mutlu and Guangming Tan and Mingyu Chen},
	booktitle = {Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA},
	title = {Poster: revisiting virtual channel memory for performance and fairness on multi-core architecture.},
	url = {http://doi.acm.org/10.1145/1995896.1995962},
	year = {2011}
}
NOCS 2011, Fifth ACM/IEEE International Symposium on Networks-on-Chip, Pittsburgh, Pennsylvania, USA, May 2011
@inproceedings{abc,
	author = {Michael Papamichael and James C. Hoe and Onur Mutlu},
	booktitle = {NOCS 2011, Fifth ACM/IEEE International Symposium on Networks-on-Chip, Pittsburgh, Pennsylvania, USA},
	title = {FIST: A fast, lightweight, FPGA-friendly packet latency estimator for NoC modeling in full-system simulations.},
	year = {2011}
}
17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), San Antonio, Texas, USA, February 2011
@inproceedings{abc,
	author = {Chris Fallin and Chris Craik and Onur Mutlu},
	booktitle = {17th International Conference on High-Performance Computer Architecture (HPCA-17 2011)},
	title = {CHIPPER: A low-complexity bufferless deflection router.},
	url = {http://dx.doi.org/10.1109/HPCA.2011.5749724},
	venue = {San Antonio, Texas, USA},
	year = {2011}
}
IEEE Micro, January 2011
@article{abc,
	author = {Yale N. Patt and Onur Mutlu},
	journal = {IEEE Micro},
	title = {Top Picks [Guest editors{\textquoteright} introduction].},
	url = {http://dx.doi.org/10.1109/MM.2011.16},
	year = {2011}
}
IEEE Trans. Computers, January 2011
@inproceedings{abc,
	author = {Chang Joo Lee and Onur Mutlu and Veynu Narasiman and Yale N. Patt},
	booktitle = {IEEE Trans. Computers},
	title = {Prefetch-Aware Memory Controllers.},
	url = {http://doi.ieeecomputersociety.org/10.1109/TC.2010.214},
	year = {2011}
}
IEEE Micro, January 2011
@article{abc,
	author = {M. Aater Suleman and Onur Mutlu and Jos{\'e} A. Joao and Khubaib and Yale N. Patt},
	journal = {IEEE Micro},
	title = {Data Marshaling for Multicore Systems.},
	url = {http://dx.doi.org/10.1109/MM.2010.105},
	year = {2011}
}
IEEE Micro, January 2011
@article{abc,
	author = {Yoongu Kim and Michael Papamichael and Onur Mutlu and Mor Harchol-Balter},
	journal = {IEEE Micro},
	title = {Thread Cluster Memory Scheduling.},
	url = {http://dx.doi.org/10.1109/MM.2011.15},
	year = {2011}
}
IEEE Micro, January 2011
@article{abc,
	author = {Reetuparna Das and Onur Mutlu and Thomas Moscibroda and Chita R. Das},
	journal = {IEEE Micro},
	title = {A{\'e}rgia: A Network-on-Chip Exploiting Packet Latency Slack.},
	url = {http://dx.doi.org/10.1109/MM.2010.98},
	year = {2011}
}

2010

43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010, Atlanta, Georgia, USA, December 2010
@inproceedings{abc,
	author = {Yoongu Kim and Michael Papamichael and Onur Mutlu and Mor Harchol-Balter},
	booktitle = {43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010},
	title = {Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior.},
	url = {http://dx.doi.org/10.1109/MICRO.2010.51},
	venue = {Atlanta, Georgia, USA},
	year = {2010}
}
Proceedings of the 9th ACM Workshop on Hot Topics in Networks. HotNets 2010, Monterey, CA, October 2010
@inproceedings{abc,
	author = {George Nychis and Chris Fallin and Thomas Moscibroda and Onur Mutlu},
	booktitle = {Proceedings of the 9th ACM Workshop on Hot Topics in Networks. HotNets 2010, Monterey, CA},
	title = {Next generation on-chip networks: what kind of congestion control do we need?},
	url = {http://doi.acm.org/10.1145/1868447.1868459},
	year = {2010}
}
19th International Conference on Parallel Architecture and Compilation Techniques, PACT 2010, Vienna, Austria, September 2010
@inproceedings{abc,
	author = {Tanaus{\'u} Ram{\'\i}rez and Alex Pajuelo and Oliverio J. Santana and Onur Mutlu and Mateo Valero},
	booktitle = {19th International Conference on Parallel Architecture and Compilation Techniques, PACT 2010, Vienna, Austria},
	title = {Efficient runahead threads.},
	url = {http://doi.acm.org/10.1145/1854273.1854328},
	year = {2010}
}
Computer Architecture - ISCA 2010 International Workshops A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Saint-Malo, France, Revised Selected Papers, June 2010
@inproceedings{abc,
	author = {Boris Grot and Stephen W. Keckler and Onur Mutlu},
	booktitle = {Computer Architecture - ISCA 2010 International Workshops A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Saint-Malo, France},
	title = {Topology-Aware Quality-of-Service Support in Highly Integrated Chip Multiprocessors.},
	url = {http://dx.doi.org/10.1007/978-3-642-24322-6_28},
	venue = {Revised Selected Papers},
	year = {2010}
}
37th International Symposium on Computer Architecture (ISCA 2010), Saint-Malo, France, June 2010
@inproceedings{abc,
	author = {Reetuparna Das and Onur Mutlu and Thomas Moscibroda and Chita R. Das},
	booktitle = {37th International Symposium on Computer Architecture (ISCA 2010)},
	title = {A{\'e}rgia: exploiting packet latency slack in on-chip networks.},
	url = {http://doi.acm.org/10.1145/1815961.1815976},
	venue = {Saint-Malo, France},
	year = {2010}
}
37th International Symposium on Computer Architecture (ISCA 2010), Saint-Malo, France, June 2010
@inproceedings{abc,
	author = {M. Aater Suleman and Onur Mutlu and Jos{\'e} A. Joao and Khubaib and Yale N. Patt},
	booktitle = {37th International Symposium on Computer Architecture (ISCA 2010)},
	title = {Data marshaling for multi-core architectures.},
	url = {http://doi.acm.org/10.1145/1815961.1816020},
	venue = {Saint-Malo, France},
	year = {2010}
}
NOCS 2010, Fourth ACM/IEEE International Symposium on Networks-on-Chip, Grenoble, France, May 2010
@inproceedings{abc,
	author = {Paul Bogdan and Miray Kas and Radu Marculescu and Onur Mutlu},
	booktitle = {NOCS 2010, Fourth ACM/IEEE International Symposium on Networks-on-Chip, Grenoble, France},
	title = {QuaLe: A Quantum-Leap Inspired Model for Non-stationary Analysis of NoC Traffic in Chip Multi-processors.},
	url = {http://doi.ieeecomputersociety.org/10.1109/NOCS.2010.34},
	year = {2010}
}
28th IEEE VLSI Test Symposium, VTS 2010, Santa Cruz, California, USA, April 2010
@inproceedings{abc,
	author = {Yanjing Li and Onur Mutlu and Donald S. Gardner and Subhasish Mitra},
	booktitle = {28th IEEE VLSI Test Symposium, VTS 2010},
	title = {Concurrent autonomous self-test for uncore components in system-on-chips.},
	url = {http://dx.doi.org/10.1109/VTS.2010.5469571},
	venue = {Santa Cruz, California, USA},
	year = {2010}
}
16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), Bangalore, India, January 2010
@inproceedings{abc,
	author = {Yoongu Kim and Dongsu Han and Onur Mutlu and Mor Harchol-Balter},
	booktitle = {16th International Conference on High-Performance Computer Architecture (HPCA-16 2010)},
	title = {ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers.},
	url = {http://dx.doi.org/10.1109/HPCA.2010.5416658},
	venue = {Bangalore, India},
	year = {2010}
}
Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, January 2010
@inproceedings{abc,
	author = {Eiman Ebrahimi and Chang Joo Lee and Onur Mutlu and Yale N. Patt},
	booktitle = {Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA},
	title = {Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems.},
	url = {http://doi.acm.org/10.1145/1736020.1736058},
	year = {2010}
}
IEEE Micro, January 2010
@article{abc,
	author = {M. Aater Suleman and Onur Mutlu and Moinuddin K. Qureshi and Yale N. Patt},
	journal = {IEEE Micro},
	title = {Accelerating Critical Section Execution with Asymmetric Multicore Architectures.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2010.7},
	year = {2010}
}
IEEE Micro, January 2010
@article{abc,
	author = {Benjamin C. Lee and Ping Zhou and Jun Yang and Youtao Zhang and Bo Zhao and Engin Ipek and Onur Mutlu and Doug Burger},
	journal = {IEEE Micro},
	title = {Phase-Change Technology and the Future of Main Memory.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2010.24},
	year = {2010}
}
Commun. CACM, January 2010
@inproceedings{abc,
	author = {Benjamin C. Lee and Engin Ipek and Onur Mutlu and Doug Burger},
	booktitle = {Commun. CACM},
	title = {Phase change memory architecture and the quest for scalability.},
	url = {http://doi.acm.org/10.1145/1785414.1785441},
	year = {2010}
}

2009

2009 International Conference on Computer-Aided Design, ICCAD 2009, San Jose, CA, USA, November 2009
@inproceedings{abc,
	author = {Yanjing Li and Onur Mutlu and Subhasish Mitra},
	booktitle = {2009 International Conference on Computer-Aided Design, ICCAD 2009, San Jose, CA, USA},
	title = {Operating system scheduling for efficient online self-test in robust systems.},
	year = {2009}
}
42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), New York, New York, USA, January 2009
@inproceedings{abc,
	author = {Eiman Ebrahimi and Onur Mutlu and Chang Joo Lee and Yale N. Patt},
	booktitle = {42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009)},
	title = {Coordinated control of multiple prefetchers in multi-core systems.},
	url = {http://doi.acm.org/10.1145/1669112.1669154},
	venue = {New York, New York, USA},
	year = {2009}
}
42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), New York, New York, USA, January 2009
@inproceedings{abc,
	author = {Chang Joo Lee and Veynu Narasiman and Onur Mutlu and Yale N. Patt},
	booktitle = {42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009)},
	title = {Improving memory bank-level parallelism in the presence of prefetching.},
	url = {http://doi.acm.org/10.1145/1669112.1669155},
	venue = {New York, New York, USA},
	year = {2009}
}
Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2009, Washington, DC, USA, January 2009
@inproceedings{abc,
	author = {M. Aater Suleman and Onur Mutlu and Moinuddin K. Qureshi and Yale N. Patt},
	booktitle = {Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2009, Washington, DC, USA},
	title = {Accelerating critical section execution with asymmetric multi-core architectures.},
	url = {http://doi.acm.org/10.1145/1508244.1508274},
	year = {2009}
}
42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), New York, New York, USA, January 2009
@inproceedings{abc,
	author = {Reetuparna Das and Onur Mutlu and Thomas Moscibroda and Chita R. Das},
	booktitle = {42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009)},
	title = {Application-aware prioritization mechanisms for on-chip networks.},
	url = {http://doi.acm.org/10.1145/1669112.1669150},
	venue = {New York, New York, USA},
	year = {2009}
}
15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), Raleigh, North Carolina, USA, January 2009
@inproceedings{abc,
	author = {Boris Grot and Joel Hestness and Stephen W. Keckler and Onur Mutlu},
	booktitle = {15th International Conference on High-Performance Computer Architecture (HPCA-15 2009)},
	title = {Express Cube Topologies for on-Chip Interconnects.},
	url = {http://dx.doi.org/10.1109/HPCA.2009.4798251},
	venue = {Raleigh, North Carolina, USA},
	year = {2009}
}
42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), New York, New York, USA, January 2009
@inproceedings{abc,
	author = {Boris Grot and Stephen W. Keckler and Onur Mutlu},
	booktitle = {42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009)},
	title = {Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip.},
	url = {http://doi.acm.org/10.1145/1669112.1669149},
	venue = {New York, New York, USA},
	year = {2009}
}
15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), Raleigh, North Carolina, USA, January 2009
@inproceedings{abc,
	author = {Eiman Ebrahimi and Onur Mutlu and Yale N. Patt},
	booktitle = {15th International Conference on High-Performance Computer Architecture (HPCA-15 2009)},
	title = {Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2009.4798232},
	venue = {Raleigh, North Carolina, USA},
	year = {2009}
}
36th International Symposium on Computer Architecture (ISCA 2009), Austin, TX, USA, January 2009
@inproceedings{abc,
	author = {Benjamin C. Lee and Engin Ipek and Onur Mutlu and Doug Burger},
	booktitle = {36th International Symposium on Computer Architecture (ISCA 2009)},
	title = {Architecting phase change memory as a scalable dram alternative.},
	url = {http://doi.acm.org/10.1145/1555754.1555758},
	venue = {Austin, TX, USA},
	year = {2009}
}
36th International Symposium on Computer Architecture (ISCA 2009), Austin, TX, USA, January 2009
@inproceedings{abc,
	author = {Jos{\'e} A. Joao and Onur Mutlu and Yale N. Patt},
	booktitle = {36th International Symposium on Computer Architecture (ISCA 2009)},
	title = {Flexible reference-counting-based hardware acceleration for garbage collection.},
	url = {http://doi.acm.org/10.1145/1555754.1555806},
	venue = {Austin, TX, USA},
	year = {2009}
}
36th International Symposium on Computer Architecture (ISCA 2009), Austin, TX, USA, January 2009
@inproceedings{abc,
	author = {Thomas Moscibroda and Onur Mutlu},
	booktitle = {36th International Symposium on Computer Architecture (ISCA 2009)},
	title = {A case for bufferless routing in on-chip networks.},
	url = {http://doi.acm.org/10.1145/1555754.1555781},
	venue = {Austin, TX, USA},
	year = {2009}
}
IEEE Trans. Computers, January 2009
@inproceedings{abc,
	author = {Hyesoon Kim and Jos{\'e} A. Joao and Onur Mutlu and Chang Joo Lee and Yale N. Patt and Robert S. Cohn},
	booktitle = {IEEE Trans. Computers},
	title = {Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware.},
	url = {http://doi.ieeecomputersociety.org/10.1109/TC.2008.227},
	year = {2009}
}
IEEE Trans. Computers, January 2009
@inproceedings{abc,
	author = {Kypros Constantinides and Onur Mutlu and Todd M. Austin and Valeria Bertacco},
	booktitle = {IEEE Trans. Computers},
	title = {A Flexible Software-Based Framework for Online Detection of Hardware Defects.},
	url = {http://doi.ieeecomputersociety.org/10.1109/TC.2009.52},
	year = {2009}
}
IEEE Micro, January 2009
@inproceedings{abc,
	author = {Onur Mutlu and Thomas Moscibroda},
	booktitle = {IEEE Micro},
	title = {Parallelism-Aware Batch Scheduling: Enabling High-Performance and Fair Shared Memory Controllers.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2009.12},
	year = {2009}
}

2008

Proceedings of the Twenty-Seventh Annual ACM Symposium on Principles of Distributed Computing, PODC 2008, Toronto, Canada, January 2008
@inproceedings{abc,
	author = {Thomas Moscibroda and Onur Mutlu},
	booktitle = {Proceedings of the Twenty-Seventh Annual ACM Symposium on Principles of Distributed Computing, PODC 2008, Toronto, Canada},
	title = {Distributed order scheduling and its application to multi-core dram controllers.},
	url = {http://doi.acm.org/10.1145/1400751.1400799},
	year = {2008}
}
IEEE Micro, January 2008
@inproceedings{abc,
	author = {Sangyeun Cho and Tao Li and Onur Mutlu},
	booktitle = {IEEE Micro},
	title = {Guest Editors{\textquoteright} Introduction: Interaction of Many-Core Computer Architecture and Operating Systems.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2008.39},
	year = {2008}
}
Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2008, Seattle, WA, USA, January 2008
@inproceedings{abc,
	author = {Jos{\'e} A. Joao and Onur Mutlu and Hyesoon Kim and Rishi Agarwal and Yale N. Patt},
	booktitle = {Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2008, Seattle, WA, USA},
	title = {Improving the performance of object-oriented languages with dynamic predication of indirect jumps.},
	url = {http://doi.acm.org/10.1145/1346281.1346293},
	year = {2008}
}
14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), Salt Lake City, UT, USA, January 2008
@inproceedings{abc,
	author = {Chang Joo Lee and Hyesoon Kim and Onur Mutlu and Yale N. Patt},
	booktitle = {14th International Conference on High-Performance Computer Architecture (HPCA-14 2008)},
	title = {Performance-aware speculation control using wrong path usefulness prediction.},
	url = {http://dx.doi.org/10.1109/HPCA.2008.4658626},
	venue = {Salt Lake City, UT, USA},
	year = {2008}
}
35th International Symposium on Computer Architecture (ISCA 2008), Beijing, China, January 2008
@inproceedings{abc,
	author = {Onur Mutlu and Thomas Moscibroda},
	booktitle = {35th International Symposium on Computer Architecture (ISCA 2008)},
	title = {Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems.},
	url = {http://dx.doi.org/10.1109/ISCA.2008.7},
	venue = {Beijing, China},
	year = {2008}
}
35th International Symposium on Computer Architecture (ISCA 2008), Beijing, China, January 2008
@inproceedings{abc,
	author = {Engin Ipek and Onur Mutlu and Jos{\'e} F. Mart{\'\i}nez and Rich Caruana},
	booktitle = {35th International Symposium on Computer Architecture (ISCA 2008)},
	title = {Self-Optimizing Memory Controllers: A Reinforcement Learning Approach.},
	url = {http://dx.doi.org/10.1109/ISCA.2008.21},
	venue = {Beijing, China},
	year = {2008}
}
41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), Lake Como, Italy, January 2008
@inproceedings{abc,
	author = {Kypros Constantinides and Onur Mutlu and Todd M. Austin},
	booktitle = {41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008)},
	title = {Online design bug detection: RTL analysis, flexible mechanisms, and evaluation.},
	url = {http://dx.doi.org/10.1109/MICRO.2008.4771798},
	venue = {Lake Como, Italy},
	year = {2008}
}
41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), Lake Como, Italy, January 2008
@inproceedings{abc,
	author = {Chang Joo Lee and Onur Mutlu and Veynu Narasiman and Yale N. Patt},
	booktitle = {41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008)},
	title = {Prefetch-Aware DRAM Controllers.},
	url = {http://dx.doi.org/10.1109/MICRO.2008.4771791},
	venue = {Lake Como, Italy},
	year = {2008}
}

2007

Proceedings of the 16th USENIX Security Symposium, Boston, MA, USA, August 2007
@inproceedings{abc,
	author = {Thomas Moscibroda and Onur Mutlu},
	booktitle = {Proceedings of the 16th USENIX Security Symposium, Boston, MA, USA},
	title = {Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems.},
	url = {https://www.usenix.org/conference/16th-usenix-security-symposium/memory-performance-attacks-denial-memory-service-multi},
	year = {2007}
}
Computer Architecture Letters, January 2007
@inproceedings{abc,
	author = {Jos{\'e} A. Joao and Onur Mutlu and Hyesoon Kim and Yale N. Patt},
	booktitle = {Computer Architecture Letters},
	title = {Dynamic Predication of Indirect Jumps.},
	url = {http://dx.doi.org/10.1109/L-CA.2007.7},
	year = {2007}
}
IEEE Micro, January 2007
@inproceedings{abc,
	author = {Hyesoon Kim and Jos{\'e} A. Joao and Onur Mutlu and Yale N. Patt},
	booktitle = {IEEE Micro},
	title = {Diverge-Merge Processor: Generalized and Energy-Efficient Dynamic Predication.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2007.9},
	year = {2007}
}
Computer Architecture Letters, January 2007
@article{abc,
	author = {Jos{\'e} A. Joao and Onur Mutlu and Hyesoon Kim and Yale N. Patt},
	journal = {Computer Architecture Letters},
	title = {Dynamic Predication of Indirect Jumps.},
	url = {http://dx.doi.org/10.1109/L-CA.2008.2},
	year = {2007}
}
Fifth International Symposium on Code Generation and Optimization (CGO 2007), San Jose, California, USA, January 2007
@inproceedings{abc,
	author = {Hyesoon Kim and Jos{\'e} A. Joao and Onur Mutlu and Yale N. Patt},
	booktitle = {Fifth International Symposium on Code Generation and Optimization (CGO 2007)},
	title = {Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors.},
	url = {http://doi.ieeecomputersociety.org/10.1109/CGO.2007.31},
	venue = {San Jose, California, USA},
	year = {2007}
}
13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), Phoenix, Arizona, USA, January 2007
@inproceedings{abc,
	author = {Santhosh Srinath and Onur Mutlu and Hyesoon Kim and Yale N. Patt},
	booktitle = {13st International Conference on High-Performance Computer Architecture (HPCA-13 2007)},
	title = {Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers.},
	url = {http://doi.ieeecomputersociety.org/10.1109/HPCA.2007.346185},
	venue = {Phoenix, Arizona, USA},
	year = {2007}
}
34th International Symposium on Computer Architecture (ISCA 2007), San Diego, California, USA, January 2007
@inproceedings{abc,
	author = {Hyesoon Kim and Jos{\'e} A. Joao and Onur Mutlu and Chang Joo Lee and Yale N. Patt and Robert S. Cohn},
	booktitle = {34th International Symposium on Computer Architecture (ISCA 2007)},
	title = {VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization.},
	url = {http://doi.acm.org/10.1145/1250662.1250715},
	venue = {San Diego, California, USA},
	year = {2007}
}
40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), Chicago, Illinois, USA, January 2007
@inproceedings{abc,
	author = {Kypros Constantinides and Onur Mutlu and Todd M. Austin and Valeria Bertacco},
	booktitle = {40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007)},
	title = {Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MICRO.2007.39},
	venue = {Chicago, Illinois, USA},
	year = {2007}
}
40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), Chicago, Illinois, USA, January 2007
@inproceedings{abc,
	author = {Onur Mutlu and Thomas Moscibroda},
	booktitle = {40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007)},
	title = {Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MICRO.2007.40},
	venue = {Chicago, Illinois, USA},
	year = {2007}
}

2006

Fourth IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2006), New York, New York, USA, January 2006
@inproceedings{abc,
	author = {Hyesoon Kim and M. Aater Suleman and Onur Mutlu and Yale N. Patt},
	booktitle = {Fourth IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2006)},
	title = {2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set.},
	url = {http://doi.ieeecomputersociety.org/10.1109/CGO.2006.1},
	venue = {New York, New York, USA},
	year = {2006}
}
33rd International Symposium on Computer Architecture (ISCA 2006), Boston, MA, USA, January 2006
@inproceedings{abc,
	author = {Moinuddin K. Qureshi and Daniel N. Lynch and Onur Mutlu and Yale N. Patt},
	booktitle = {33rd International Symposium on Computer Architecture (ISCA 2006)},
	title = {A Case for MLP-Aware Cache Replacement.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ISCA.2006.5},
	venue = {Boston, MA, USA},
	year = {2006}
}
39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), Orlando, Florida, USA, January 2006
@inproceedings{abc,
	author = {Hyesoon Kim and Jos{\'e} A. Joao and Onur Mutlu and Yale N. Patt},
	booktitle = {39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006)},
	title = {Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MICRO.2006.20},
	venue = {Orlando, Florida, USA},
	year = {2006}
}
IEEE Micro, January 2006
@inproceedings{abc,
	author = {Hyesoon Kim and Onur Mutlu and Yale N. Patt and Jared Stark},
	booktitle = {IEEE Micro},
	title = {Wish Branches: Enabling Adaptive and Aggressive Predicated Execution.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2006.27},
	year = {2006}
}
IEEE Micro, January 2006
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and Yale N. Patt},
	booktitle = {IEEE Micro},
	title = {Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2006.10},
	year = {2006}
}
IEEE Trans. Computers, January 2006
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and Yale N. Patt},
	booktitle = {IEEE Trans. Computers},
	title = {Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses.},
	url = {http://doi.ieeecomputersociety.org/10.1109/TC.2006.191},
	year = {2006}
}

2005

2005 International Conference on Dependable Systems and Networks (DSN 2005), January 2005
@inproceedings{abc,
	author = {Moinuddin K. Qureshi and Onur Mutlu and Yale N. Patt},
	booktitle = {2005 International Conference on Dependable Systems and Networks (DSN 2005)},
	title = {Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors.},
	url = {http://doi.ieeecomputersociety.org/10.1109/DSN.2005.62},
	year = {2005}
}
32st International Symposium on Computer Architecture (ISCA 2005), Madison, Wisconsin, USA, January 2005
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and Yale N. Patt},
	booktitle = {32st International Symposium on Computer Architecture (ISCA 2005)},
	title = {Techniques for Efficient Processing in Runahead Execution Engines.},
	url = {http://csdl.computer.org/comp/proceedings/isca/2005/2270/00/22700370abs.htm},
	venue = {Madison, Wisconsin, USA},
	year = {2005}
}
38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), Barcelona, Spain, January 2005
@inproceedings{abc,
	author = {Hyesoon Kim and Onur Mutlu and Jared Stark and Yale N. Patt},
	booktitle = {38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005)},
	title = {Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MICRO.2005.38},
	venue = {Barcelona, Spain},
	year = {2005}
}
38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), Barcelona, Spain, January 2005
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and Yale N. Patt},
	booktitle = {38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005)},
	title = {Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MICRO.2005.11},
	venue = {Barcelona, Spain},
	year = {2005}
}
Computer Architecture Letters, January 2005
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and Jared Stark and Yale N. Patt},
	booktitle = {Computer Architecture Letters},
	title = {On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor.},
	url = {http://dx.doi.org/10.1109/L-CA.2005.1},
	year = {2005}
}
International Journal of Parallel Programming, January 2005
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and David N. Armstrong and Yale N. Patt},
	booktitle = {International Journal of Parallel Programming},
	title = {Using the First-Level Caches as Filters to Reduce the Pollution Caused by Speculative Memory References.},
	url = {http://dx.doi.org/10.1007/s10766-005-7304-x},
	year = {2005}
}
IEEE Trans. Computers, January 2005
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and David N. Armstrong and Yale N. Patt},
	booktitle = {IEEE Trans. Computers},
	title = {An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors.},
	url = {http://doi.ieeecomputersociety.org/10.1109/TC.2005.190},
	year = {2005}
}

2004

37th Annual International Symposium on Microarchitecture (MICRO-37 2004), Portland, OR, USA, January 2004
@inproceedings{abc,
	author = {David N. Armstrong and Hyesoon Kim and Onur Mutlu and Yale N. Patt},
	booktitle = {37th Annual International Symposium on Microarchitecture (MICRO-37 2004)},
	title = {Wrong Path Events: Exploiting Unusual and Illegal Program Behavior for Early Misprediction Detection and Recovery.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MICRO.2004.38},
	venue = {Portland, OR, USA},
	year = {2004}
}
16th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2004), Foz do Iguacu, Brazil, January 2004
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and David N. Armstrong and Yale N. Patt},
	booktitle = {16th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2004)},
	title = {Cache Filtering Techniques to Reduce the Negative Impact of Useless Speculative Memory References on Processor Performance.},
	url = {http://csdl.computer.org/comp/proceedings/sbac-pad/2004/2240/00/22400002abs.htm},
	venue = {Foz do Iguacu, Brazil},
	year = {2004}
}
Proceedings of the 3rd Workshop on Memory Performance Issues, in conjunction with the 31st International Symposium on Computer Architecture 2004, Munich, Germany, January 2004
@inproceedings{abc,
	author = {Onur Mutlu and Hyesoon Kim and David N. Armstrong and Yale N. Patt},
	booktitle = {Proceedings of the 3rd Workshop on Memory Performance Issues, in conjunction with the 31st International Symposium on Computer Architecture 2004, Munich, Germany},
	title = {Understanding the effects of wrong-path memory references on processor performance.},
	url = {http://doi.acm.org/10.1145/1054943.1054951},
	year = {2004}
}

2003

IEEE Micro, January 2003
@inproceedings{abc,
	author = {Onur Mutlu and Jared Stark and Chris Wilkerson and Yale N. Patt},
	booktitle = {IEEE Micro},
	title = {Runahead Execution: An Effective Alternative to Large Instruction Windows.},
	url = {http://csdl.computer.org/comp/mags/mi/2003/06/m6020abs.htm},
	year = {2003}
}
HPCA, January 2003
@inproceedings{abc,
	author = {Onur Mutlu and Jared Stark and Chris Wilkerson and Yale N. Patt},
	booktitle = {HPCA},
	title = {Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors.},
	url = {http://computer.org/proceedings/hpca/1871/18710129abs.htm},
	year = {2003}
}