Publications by Onur Mutlu | Publications - Systems Group, ETH Zurich

Status message

The Publications site is currently under construction, as a result some publications might be missing.

2018

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed B. Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.

@inproceedings{abc,
	abstract = {Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. 
In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp{\textquoteright}s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8{\texttimes} larger capacity and improving overall GPU performance by 31\% while reducing register file power consumption by 46\%.},
	author = {Mohammad Sadrosadati and Amirhossein Mirhosseini and Seyed B. Ehsani and Hamid Sarbazi-Azad and Mario Drumond and Babak Falsafi and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching},
	url = {https://dl.acm.org/citation.cfm?id=3173211},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}

SPECTR: Formal Supervisory Control and Coordination for Many-core Systems Resource Management

Amir M. Rahmani, Bryan Donyanavard, Tiago Mück, Kasra Moazzemi, Axel Jantsch, Onur Mutlu, and Nikil Dutt

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018

Resource management strategies for many-core systems need to enable sharing of resources such as power, processing cores, and memory bandwidth while coordinating the priority and significance of system- and application-level objectives at runtime in a scalable and robust manner. State-of-the-art approaches use heuristics or machine learning for resource management, but unfortunately lack formalism in providing robustness against unexpected corner cases. While recent efforts deploy classical control-theoretic approaches with some guarantees and formalism, they lack scalability and autonomy to meet changing runtime goals. We present SPECTR, a new resource management approach for many-core systems that leverages formal supervisory control theory (SCT) to combine the strengths of classical control theory with state-of-the-art heuristic approaches to efficiently meet changing runtime goals. SPECTR is a scalable and robust control architecture and a systematic design flow for hierarchical control of many-core systems. SPECTR leverages SCT techniques such as gain scheduling to allow autonomy for individual controllers. It facilitates automatic synthesis of the high-level supervisory controller and its property verification. We implement SPECTR on an Exynos platform containing ARM’s big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that SPECTR’s use of SCT is key to managing multiple interacting resources (e.g., chip power and processing cores) in the presence of competing objectives (e.g., satisfying QoS vs. power capping). The principles of SPECTR are easily applicable to any resource type and objective as long as the management problem can be modeled using dynamical systems theory (e.g., difference equations), discrete-event dynamic systems, or fuzzy dynamics.

@inproceedings{abc,
	abstract = {Resource management strategies for many-core systems need to enable sharing of resources such as power, processing cores, and memory bandwidth while coordinating the priority and significance of system- and application-level objectives at runtime in a scalable and robust manner. State-of-the-art approaches use heuristics or machine learning for resource management, but unfortunately lack formalism in providing robustness against unexpected corner cases. While recent efforts deploy classical control-theoretic approaches with some guarantees and formalism, they lack scalability and autonomy to meet changing runtime goals. We present SPECTR, a new resource management approach for many-core systems that leverages formal supervisory control theory (SCT) to combine the strengths of classical control theory with state-of-the-art heuristic approaches to efficiently meet changing runtime goals. SPECTR is a scalable and robust control architecture and a systematic design flow for hierarchical control of many-core systems. SPECTR leverages SCT techniques such as gain scheduling to allow autonomy for individual controllers. It facilitates automatic synthesis of the high-level supervisory controller and its property verification. We implement SPECTR on an Exynos platform containing ARM{\textquoteright}s big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that SPECTR{\textquoteright}s use of SCT is key to managing multiple interacting resources (e.g., chip power and processing cores) in the presence of competing objectives (e.g., satisfying QoS vs. power capping). The principles of SPECTR are easily applicable to any resource type and objective as long as the management problem can be modeled using dynamical systems theory (e.g., difference equations), discrete-event dynamic systems, or fuzzy dynamics.},
	author = {Amir M. Rahmani and Bryan Donyanavard and Tiago Mück and Kasra Moazzemi and Axel Jantsch and Onur Mutlu and Nikil Dutt},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {SPECTR: Formal Supervisory Control and Coordination for Many-core Systems Resource Management},
	url = {https://dl.acm.org/citation.cfm?id=3173199},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Dae-Hyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google’s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-inmemory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4% across the workloads) and execution time (by an average of 54.2%).

@inproceedings{abc,
	abstract = {We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google{\textquoteright}s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-inmemory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4\% across the workloads) and execution time (by an average of 54.2\%).},
	author = {Amirali Boroumand and Saugata Ghose and Youngsok Kim and Rachata Ausavarungnirun and Eric Shiu and Rahul Thakur and Dae-Hyun Kim and Aki Kuusela and Allan Knies and Parthasarathy Ranganathan and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}

Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability

Maciej Besta, Syed M. Hassan, Sudhakar Yalamanchili, Rachata Ausavarungnirun, Onur Mutlu, and Torsten Hoefler

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018

Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.

@inproceedings{abc,
	abstract = {Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.},
	author = {Maciej Besta and Syed M. Hassan and Sudhakar Yalamanchili and Rachata Ausavarungnirun and Onur Mutlu and Torsten Hoefler},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher Rossbach, and Onur Mutlu

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK’s system throughput is within 23.2% of an ideal GPU system with no address translation overhead.

@inproceedings{abc,
	abstract = {Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8\%, improves IPC throughput by 43.4\%, and reduces applicationlevel unfairness by 22.4\%. MASK{\textquoteright}s system throughput is within 23.2\% of an ideal GPU system with no address translation overhead.},
	author = {Rachata Ausavarungnirun and Vance Miller and Joshua Landgraf and Saugata Ghose and Jayneel Gandhi and Adwait Jog and Christopher Rossbach and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}

HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature-Awareness

Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu

Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018

NAND flash memory density continues to scale to keep up with the increasing storage demands of data-intensive applications. Unfortunately, as a result of this scaling, the lifetime of NAND flash memory has been decreasing. Each cell in NAND flash memory can endure only a limited number of writes, due to the damage caused by each program and erase operation on the cell. This damage can be partially repaired on its own during the idle time between program or erase operations (known as the dwell time), via a phenomenon known as the self-recovery effect. Prior works study the self-recovery effect for planar (i.e., 2D) NAND flash memory, and propose to exploit it to improve flash lifetime, by applying high temperature to accelerate selfrecovery. However, these findings may not be directly applicable to 3D NAND flash memory, due to significant changes in the design and manufacturing process that are required to enable practical 3D stacking for NAND flash memory. In this paper, we perform the first detailed experimental characterization of the effects of self-recovery and temperature on real, state-of-the-art 3D NAND flash memory devices. We show that these effects influence two major factors of NAND flash memory reliability: (1) retention loss speed (i.e., the speed at which a flash cell leaks charge), and (2) program variation (i.e., the difference in programming speed across flash cells). We find that self-recovery and temperature affect 3D NAND flash memory quite differently than they affect planar NAND flash memory, rendering prior models of self-recovery and temperature ineffective for 3D NAND flash memory. Using our characterization results, we develop a new model for 3D NAND flash memory reliability, which predicts how retention, wearout, self-recovery, and temperature affect raw bit error rates and cell threshold voltages. We show that our model is accurate, with an error of only 4.9%. Based on our experimental findings and our model, we propose HeatWatch, a new mechanism to improve 3D NAND flash memory reliability. The key idea of HeatWatch is to optimize the read reference voltage, i.e., the voltage applied to the cell during a read operation, by adapting it to the dwell time of the workload and the current operating temperature. HeatWatch (1) efficiently tracks flash memory temperature and dwell time online, (2) sends this information to our reliability model to predict the current voltages of flash cells, and (3) predicts the optimal read reference voltage based on the current cell voltages. Our detailed experimental evaluations show that HeatWatch improves flash lifetime by 3.85× over a baseline that uses a fixed read reference voltage, averaged across 28 real storage workload traces, and comes within 0.9% of the lifetime of an ideal read reference voltage selection mechanism.

@inproceedings{abc,
	abstract = {NAND flash memory density continues to scale to keep up with the increasing storage demands of data-intensive applications. Unfortunately, as a result of this scaling, the lifetime of NAND flash memory has been decreasing. Each cell in NAND flash memory can endure only a limited number of writes, due to the damage caused by each program and erase operation on the cell. This damage can be partially repaired on its own during the idle time between program or erase operations (known as the dwell time), via a phenomenon known as the self-recovery effect. Prior works study the self-recovery effect for planar (i.e., 2D) NAND flash memory, and propose to exploit it to improve flash lifetime, by applying high temperature to accelerate selfrecovery. However, these findings may not be directly applicable to 3D NAND flash memory, due to significant changes in the design and manufacturing process that are required to enable practical 3D stacking for NAND flash memory. In this paper, we perform the first detailed experimental characterization of the effects of self-recovery and temperature on real, state-of-the-art 3D NAND flash memory devices. We show that these effects influence two major factors of NAND flash memory reliability: (1) retention loss speed (i.e., the speed at which a flash cell leaks charge), and (2) program variation (i.e., the difference in programming speed across flash cells). We find that self-recovery and temperature affect 3D NAND flash memory quite differently than they affect planar NAND flash memory, rendering prior models of self-recovery and temperature ineffective for 3D NAND flash memory. Using our characterization results, we develop a new model for 3D NAND flash memory reliability, which predicts how retention, wearout, self-recovery, and temperature affect raw bit error rates and cell threshold voltages. We show that our model is accurate, with an error of only 4.9\%. Based on our experimental findings and our model, we propose HeatWatch, a new mechanism to improve 3D NAND flash memory reliability. The key idea of HeatWatch is to optimize the read reference voltage, i.e., the voltage applied to the cell during a read operation, by adapting it to the dwell time of the workload and the current operating temperature. HeatWatch (1) efficiently tracks flash memory temperature and dwell time online, (2) sends this information to our reliability model to predict the current voltages of flash cells, and (3) predicts the optimal read reference voltage based on the current cell voltages. Our detailed experimental evaluations show that HeatWatch improves flash lifetime by 3.85{\texttimes} over a baseline that uses a fixed read reference voltage, averaged across 28 real storage workload traces, and comes within 0.9\% of the lifetime of an ideal read reference voltage selection mechanism.},
	author = {Yixin Luo and Saugata Ghose and Yu Cai and Erich F. Haratsch and Onur Mutlu},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA)},
	title = {HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature-Awareness},
	venue = {Vienna, Austria},
	year = {2018}
}

The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Mod...

Jeremie Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu

Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018

Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55◦C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70◦C and 1426x (868x, 1783x) at 55◦C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.

@inproceedings{abc,
	abstract = {Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55{\textopenbullet}C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70{\textopenbullet}C and 1426x (868x, 1783x) at 55{\textopenbullet}C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.},
	author = {Jeremie Kim and Minesh Patel and Hasan Hassan and Onur Mutlu},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA)},
	title = {The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Mod...},
	venue = {Vienna, Austria},
	year = {2018}
}

MQsim: a framework for enabling realistic studies of modern multi-queue SSD devices

Arash Tavakkol, Juan Gomez-Luna, Mohammad Sadrosadati, Saugata Ghose, and Onur Mutlu

Proceedings of the 16th USENIX Conference on File and Storage Technologies, Oakland, CA, USA, February 2018

Solid-state drives (SSDs) are used in a wide array of computer systems today, including in datacenters and enterprise servers. As the I/O demands of these systems continue to increase, manufacturers are evolving SSD architectures to keep up with this demand. For example, manufacturers have introduced new high-bandwidth interfaces to replace the conventional SATA host-interface protocol. These new interfaces, such as the NVMe protocol, are designed specifically to enable the high amounts of concurrent I/O bandwidth that SSDs are capable of delivering. While modern SSDs with sophisticated features such as the NVMe protocol are already on the market, existing SSD simulation tools have fallen behind, as they do not capture these new features. We find that state-of-the-art SSD simulators have three shortcomings that prevent them from accurately modeling the performance of real off-the-shelf SSDs. First, these simulators do not model critical features of new protocols (e.g., NVMe), such as their use of multiple application-level queues for requests and the elimination of OS intervention for I/O request processing. Second, these simulators often do not accurately capture the impact of advanced SSD maintenance algorithms (e.g., garbage collection), as they do not properly or quickly emulate steady-state conditions that can significantly change the behavior of these algorithms in real SSDs. Third, these simulators do not capture the full end-to-end latency of I/O requests, which can incorrectly skew the results reported for SSDs that make use of emerging non-volatile memory technologies. By not accurately modeling these three features, existing simulators report results that deviate significantly from real SSD performance. In this work, we introduce a new simulator, called MQSim, that accurately models the performance of both modern SSDs and conventional SATA-based SSDs. MQSim faithfully models new high-bandwidth protocol implementations, steady-state SSD conditions, and the full end-to-end latency of requests in modern SSDs. We validate MQSim, showing that it reports performance results that are only 6%-18% apart from the measured actual performance of four real state-of-the-art SSDs. We show that by modeling critical features of modern SSDs, MQSim uncovers several real and important issues that were not captured by existing simulators, such as the performance impact of inter-flow interference. We have released MQSim as an open-source tool, and we hope that it can enable researchers to explore directions in new and different areas.

@inproceedings{abc,
	abstract = {Solid-state drives (SSDs) are used in a wide array of computer systems today, including in datacenters and enterprise servers. As the I/O demands of these systems continue to increase, manufacturers are evolving SSD architectures to keep up with this demand. For example, manufacturers have introduced new high-bandwidth interfaces to replace the conventional SATA host-interface protocol. These new interfaces, such as the NVMe protocol, are designed specifically to enable the high amounts of concurrent I/O bandwidth that SSDs are capable of delivering.

While modern SSDs with sophisticated features such as the NVMe protocol are already on the market, existing SSD simulation tools have fallen behind, as they do not capture these new features. We find that state-of-the-art SSD simulators have three shortcomings that prevent them from accurately modeling the performance of real off-the-shelf SSDs. First, these simulators do not model critical features of new protocols (e.g., NVMe), such as their use of multiple application-level queues for requests and the elimination of OS intervention for I/O request processing. Second, these simulators often do not accurately capture the impact of advanced SSD maintenance algorithms (e.g., garbage collection), as they do not properly or quickly emulate steady-state conditions that can significantly change the behavior of these algorithms in real SSDs. Third, these simulators do not capture the full end-to-end latency of I/O requests, which can incorrectly skew the results reported for SSDs that make use of emerging non-volatile memory technologies. By not accurately modeling these three features, existing simulators report results that deviate significantly from real SSD performance.

In this work, we introduce a new simulator, called MQSim, that accurately models the performance of both modern SSDs and conventional SATA-based SSDs. MQSim faithfully models new high-bandwidth protocol implementations, steady-state SSD conditions, and the full end-to-end latency of requests in modern SSDs. We validate MQSim, showing that it reports performance results that are only 6\%-18\% apart from the measured actual performance of four real state-of-the-art SSDs. We show that by modeling critical features of modern SSDs, MQSim uncovers several real and important issues that were not captured by existing simulators, such as the performance impact of inter-flow interference. We have released MQSim as an open-source tool, and we hope that it can enable researchers to explore directions in new and different areas.},
	author = {Arash Tavakkol and Juan Gomez-Luna and Mohammad Sadrosadati and Saugata Ghose and Onur Mutlu},
	booktitle = {Proceedings of the 16th USENIX Conference on File and Storage Technologies},
	title = {MQsim: a framework for enabling realistic studies of modern multi-queue SSD devices},
	venue = {Oakland, CA, USA},
	year = {2018}
}

2017

GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping

Mohammed Alser, Hasan Hassan, Hongyi Xin, Oguz Ergin, Onur Mutlu, and Can Alkan

Bioinformatics, November 2017

Motivation High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and ‘candidate’ locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper’s execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. Results We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.

@article{abc,
	abstract = {Motivation
High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and {\textquoteleft}candidate{\textquoteright} locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper{\textquoteright}s execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms.

Results
We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96\%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.},
	author = {Mohammed Alser and Hasan Hassan and Hongyi Xin and Oguz Ergin and Onur Mutlu and Can Alkan},
	pages = {3355-3363},
	journal = {Bioinformatics},
	title = {GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping},
	volume = {33},
	year = {2017}
}

Detecting and mitigating data-dependent DRAM failures by exploiting current memory content

Samira Manabi Khan, Chris Wilkerson, Zhe Wang, Alaa R. Alameldeen, Donghyuk Lee, and Onur Mutlu

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017

DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge. In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle. Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65--74%, leading to a 10%/17%/40% (min) to 12%/22%/50% (max) performance improvement for a single-core and 10%/23%/52% (min) to 17%/29%/65% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.

@inproceedings{abc,
abstract = {DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge.

In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle.

Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65--74\%, leading to a 10\%/17\%/40\% (min) to 12\%/22\%/50\% (max) performance improvement for a single-core and 10\%/23\%/52\% (min) to 17\%/29\%/65\% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.},
author = {Samira Manabi Khan and Chris Wilkerson and Zhe Wang and Alaa R. Alameldeen and Donghyuk Lee and Onur Mutlu},
booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
title = {Detecting and mitigating data-dependent DRAM failures by exploiting current memory content},
venue = {Cambridge, MA, USA},
year = {2017}
}

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017

Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory). To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus. Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.

@inproceedings{abc,
abstract = {Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).

To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1\% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus.

Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.},
author = {Vivek Seshadri and Donghyuk Lee and Thomas Mullins and Hasan Hassan and Amirali Boroumand and Jeremie Kim and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
title = {Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology},
venue = {Cambridge, MA, USA},
year = {2017}
}

Utility-Based Hybrid Memory Management

Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, and Onur Mutlu

Proceedins of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA, September 2017

While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity. Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (e.g., different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy. The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology. Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page. It is important to make intelligent page management (i.e., placement and migration) decisions, as they can significantly affect system performance.In this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (i.e., the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement. UH-MEM operates in two steps. First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism. Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migration.We evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories. For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14% on average (and up to 26%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads.

@inproceedings{abc,
	abstract = {While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity. Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (e.g., different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy. The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology. Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page. It is important to make intelligent page management (i.e., placement and migration) decisions, as they can significantly affect system performance.In this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (i.e., the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement. UH-MEM operates in two steps. First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism. Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migration.We evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories. For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14\% on average (and up to 26\%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads.},
	author = {Yang Li and Saugata Ghose and Jongmoo Choi and Jin Sun and Hui Wang and Onur Mutlu},
	booktitle = {Proceedins of the 2017 IEEE International Conference on Cluster Computing (CLUSTER)},
	title = {Utility-Based Hybrid Memory Management},
	venue = {Honolulu, HI, USA},
	year = {2017}
}

Concurrent Data Structures for Near-Memory Computing.

Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 2017

@inproceedings{abc,
	author = {Zhiyu Liu and Irina Calciu and Maurice Herlihy and Onur Mutlu},
	booktitle = {Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA},
	title = {Concurrent Data Structures for Near-Memory Computing.},
	url = {http://doi.acm.org/10.1145/3087556.3087582},
	year = {2017}
}

Carpool: a bufferless on-chip network supporting adaptive multicast and hotspot alleviation.

Xi-Yue Xiang, Wentao Shi, Saugata Ghose, Lu Peng, Onur Mutlu, and Nian-Feng Tzeng

Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 2017

@inproceedings{abc,
	author = {Xi-Yue Xiang and Wentao Shi and Saugata Ghose and Lu Peng and Onur Mutlu and Nian-Feng Tzeng},
	booktitle = {Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA},
	title = {Carpool: a bufferless on-chip network supporting adaptive multicast and hotspot alleviation.},
	url = {http://doi.acm.org/10.1145/3079079.3079090},
	year = {2017}
}

The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions.

Minesh Patel, Jeremie Kim, and Onur Mutlu

Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 2017

@inproceedings{abc,
	author = {Minesh Patel and Jeremie Kim and Onur Mutlu},
	booktitle = {Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada},
	title = {The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions.},
	url = {http://doi.acm.org/10.1145/3079856.3080242},
	year = {2017}
}

Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.

Kevin K. Chang, Abdullah Giray Yaglikçi, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O'Connor, Hasan Hassan, and Onur Mutlu

Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA, June 2017

@inproceedings{abc,
	author = {Kevin K. Chang and Abdullah Giray Yaglik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.},
	url = {http://doi.acm.org/10.1145/3078505.3078590},
	year = {2017}
}

Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.

Donghyuk Lee, Samira Manabi Khan, Lavanya Subramanian, Saugata Ghose, Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and Onur Mutlu

Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA, June 2017

@inproceedings{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Saugata Ghose and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu},
	booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA},
	title = {Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.},
	url = {http://doi.acm.org/10.1145/3078505.3078533},
	year = {2017}
}

FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off.

Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and Ce Zhang

25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, Napa, CA, USA, April 2017

Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme-called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.

@inproceedings{abc,
	abstract = {Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme-called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.},
	author = {Kaan Kara and Dan Alistarh and Gustavo Alonso and Onur Mutlu and Ce Zhang},
	booktitle = {25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA},
	title = {FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off.},
	url = {https://doi.org/10.1109/FCCM.2017.39},
	venue = {Napa, CA, USA},
	year = {2017}
}

The RowHammer problem and other issues we may face as memory becomes denser.

Onur Mutlu

Design, Automation Test in Europe Conference Exhibition, DATE 2017, Lausanne, Switzerland, March 2017

@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {Design, Automation  Test in Europe Conference  Exhibition, DATE 2017, Lausanne, Switzerland},
	title = {The RowHammer problem and other issues we may face as memory becomes denser.},
	url = {https://doi.org/10.23919/DATE.2017.7927156},
	year = {2017}
}

Improving the Reliability of Chip-Off Forensic Analysis of NAND Flash Memory Devices

Aya Fukami, Saugata Ghose, Yixin Luo, Yu Cai, and Onur Mutlu

Proceedings of the 2017 Digital Forensics Conference, Überlingen, Germany, March 2017

Digital forensic investigators often need to extract data from a seized device that contains NAND flash memory. Many such devices are physically damaged, preventing investigators from using automated techniques to extract the data stored within the device. Instead, investigators turn to chip-off analysis, where they use a thermal-based procedure to physically remove the NAND flash memory chip from the device, and access the chip directly to extract the raw data stored on the chip. We perform an analysis of the errors introduced into multi-level cell (MLC) NAND flash memory chips after the device has been seized. We make two major observations. First, between the time that a device is seized and the time digital forensic investigators perform data extraction, a large number of errors can be introduced as a result of charge leakage from the cells of the NAND flash memory (known as data retention errors). Second, when thermal-based chip removal is performed, the number of errors in the data stored within NAND flash memory can increase by two or more orders of magnitude, as the high temperature applied to the chip greatly accelerates charge leakage. We demonstrate that the chip-off analysis based forensic data recovery procedure is quite destructive, and can often render most of the data within NAND flash memory uncorrectable, and, thus, unrecoverable. To mitigate the errors introduced during the forensic recovery process, we explore a new hardware- based approach. We exploit a fine-grained read reference voltage control mechanism implemented in modern NAND flash memory chips, called read-retry, which can compensate for the charge leakage that occurs due to (1) retention loss and (2) thermal-based chip removal. The read-retry mechanism successfully reduces the number of errors, such that the original data can be fully recovered in our tested chips as long as the chips were not heavily used prior to seizure. We conclude that the read-retry mechanism should be adopted as part of the forensic data recovery process.

@inproceedings{abc,
abstract = {Digital forensic investigators often need to extract data from a seized device that contains NAND flash memory. Many such devices are physically damaged, preventing investigators from using automated techniques to extract the data stored within the device. Instead, investigators turn to chip-off analysis, where they use a thermal-based procedure to physically remove the NAND flash memory chip from the device, and access the chip directly to extract the raw data stored on the chip.

We perform an analysis of the errors introduced into multi-level cell (MLC) NAND flash memory chips after the device has been seized. We make two major observations. First, between the time that a device is seized and the time digital forensic investigators perform data extraction, a large number of errors can be introduced as a result of charge leakage from the cells of the NAND flash memory (known as data retention errors). Second, when thermal-based chip removal is performed, the number of errors in the data stored within NAND flash memory can increase by two or more orders of magnitude, as the high temperature applied to the chip greatly accelerates charge leakage. We demonstrate that the chip-off analysis based forensic data recovery procedure is quite destructive, and can often render most of the data within NAND flash memory uncorrectable, and, thus, unrecoverable.

To mitigate the errors introduced during the forensic recovery process, we explore a new hardware- based approach. We exploit a fine-grained read reference voltage control mechanism implemented in modern NAND flash memory chips, called read-retry, which can compensate for the charge leakage that occurs due to (1) retention loss and (2) thermal-based chip removal. The read-retry mechanism successfully reduces the number of errors, such that the original data can be fully recovered in our tested chips as long as the chips were not heavily used prior to seizure. We conclude that the read-retry mechanism should be adopted as part of the forensic data recovery process. },
author = {Aya Fukami and Saugata Ghose and Yixin Luo and Yu Cai and Onur Mutlu},
booktitle = {Proceedings of the 2017 Digital Forensics Conference},
title = {Improving the Reliability of Chip-Off Forensic Analysis of NAND Flash Memory Devices},
venue = {{\"U}berlingen, Germany},
year = {2017}
}

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.

Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu

14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 2017

@inproceedings{abc,
	author = {Kevin Hsieh and Aaron Harlap and Nandita Vijaykumar and Dimitris Konomis and Gregory R. Ganger and Phillip B. Gibbons and Onur Mutlu},
	booktitle = {14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA},
	title = {Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.},
	url = {https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh},
	year = {2017}
}

Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques.

Yu Cai, Saugata Ghose, Yixin Luo, Ken Mai, Onur Mutlu, and Erich F. Haratsch

2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 2017

@inproceedings{abc,
	author = {Yu Cai and Saugata Ghose and Yixin Luo and Ken Mai and Onur Mutlu and Erich F. Haratsch},
	booktitle = {2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA},
	title = {Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques.},
	url = {https://doi.org/10.1109/HPCA.2017.61},
	year = {2017}
}

SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.

Hasan Hassan, Nandita Vijaykumar, Samira Manabi Khan, Saugata Ghose, Kevin K. Chang, Gennady Pekhimenko, Donghyuk Lee, Oguz Ergin, and Onur Mutlu

2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 2017

@inproceedings{abc,
	author = {Hasan Hassan and Nandita Vijaykumar and Samira Manabi Khan and Saugata Ghose and Kevin K. Chang and Gennady Pekhimenko and Donghyuk Lee and Oguz Ergin and Onur Mutlu},
	booktitle = {2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA},
	title = {SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.},
	url = {https://doi.org/10.1109/HPCA.2017.62},
	year = {2017}
}

Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives.

Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu

CoRR, January 2017

@article{abc,
	author = {Yu Cai and Saugata Ghose and Erich F. Haratsch and Yixin Luo and Onur Mutlu},
	journal = {CoRR},
	title = {Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives.},
	url = {http://arxiv.org/abs/1706.08642},
	year = {2017}
}

Using ECC DRAM to Adaptively Increase Memory Capacity.

Yixin Luo, Saugata Ghose, Tianshi Li, Sriram Govindan, Bikash Sharma, Bryan Kelly, Amirali Boroumand, and Onur Mutlu

CoRR, January 2017

@article{abc,
	author = {Yixin Luo and Saugata Ghose and Tianshi Li and Sriram Govindan and Bikash Sharma and Bryan Kelly and Amirali Boroumand and Onur Mutlu},
	journal = {CoRR},
	title = {Using ECC DRAM to Adaptively Increase Memory Capacity.},
	url = {http://arxiv.org/abs/1706.08870},
	year = {2017}
}

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency.

Youyou Lu, Jiwu Shu, Long Sun, and Onur Mutlu

CoRR, January 2017

@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Long Sun and Onur Mutlu},
	booktitle = {CoRR},
	title = {Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency.},
	url = {http://arxiv.org/abs/1705.03623},
	year = {2017}
}

LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures.

Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Nastaran Hajinazar, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu

CoRR, January 2017

@article{abc,
	author = {Amirali Boroumand and Saugata Ghose and Minesh Patel and Hasan Hassan and Brandon Lucia and Nastaran Hajinazar and Kevin Hsieh and Krishna T. Malladi and Hongzhong Zheng and Onur Mutlu},
	journal = {CoRR},
	title = {LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures.},
	url = {http://arxiv.org/abs/1706.03162},
	year = {2017}
}

Understanding Reduced-Voltage Operation in Modern DRAM Chips: Characterization, Analysis, and Mechanisms.

Kevin K. Chang, Abdullah Giray Yaglikçi, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O'Connor, Hasan Hassan, and Onur Mutlu

CoRR, January 2017

@inproceedings{abc,
	author = {Kevin K. Chang and Abdullah Giray Yaglik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	booktitle = {CoRR},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Chips: Characterization, Analysis, and Mechanisms.},
	url = {http://arxiv.org/abs/1705.10292},
	year = {2017}
}

Chapter Four - Simple Operations in Memory to Reduce Data Movement.

Vivek Seshadri, and Onur Mutlu

Advances in Computers, January 2017

@article{abc,
	author = {Vivek Seshadri and Onur Mutlu},
	journal = {Advances in Computers},
	title = {Chapter Four - Simple Operations in Memory to Reduce Data Movement.},
	url = {https://doi.org/10.1016/bs.adcom.2017.04.004},
	year = {2017}
}

Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.

Kevin K. Chang, A. Giray Yaalikçi, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O'Connor, Hasan Hassan, and Onur Mutlu

POMACS, January 2017

@article{abc,
	author = {Kevin K. Chang and A. Giray Yaalik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	journal = {POMACS},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.},
	url = {http://doi.acm.org/10.1145/3084447},
	year = {2017}
}

Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.

Donghyuk Lee, Samira Manabi Khan, Lavanya Subramanian, Saugata Ghose, Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and Onur Mutlu

POMACS, January 2017

@article{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Saugata Ghose and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu},
	journal = {POMACS},
	title = {Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.},
	url = {http://doi.acm.org/10.1145/3084464},
	year = {2017}
}

The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser.

Onur Mutlu

CoRR, January 2017

@article{abc,
	author = {Onur Mutlu},
	journal = {CoRR},
	title = {The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser.},
	url = {http://arxiv.org/abs/1703.00626},
	year = {2017}
}

Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation.

Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, Onur Mutlu, and Srinivas Devadas

CoRR, January 2017

@article{abc,
	author = {Xiangyao Yu and Christopher J. Hughes and Nadathur Satish and Onur Mutlu and Srinivas Devadas},
	journal = {CoRR},
	title = {Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation.},
	url = {http://arxiv.org/abs/1704.02677},
	year = {2017}
}

LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory.

Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu

Computer Architecture Letters, January 2017

@inproceedings{abc,
	author = {Amirali Boroumand and Saugata Ghose and Minesh Patel and Hasan Hassan and Brandon Lucia and Kevin Hsieh and Krishna T. Malladi and Hongzhong Zheng and Onur Mutlu},
	booktitle = {Computer Architecture Letters},
	title = {LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory.},
	url = {https://doi.org/10.1109/LCA.2016.2577557},
	year = {2017}
}

2016

NVMOVE: Helping Programmers Move to Byte-Based Persistence.

Himanshu Chauhan, Irina Calciu, Vijay Chidambaram, Eric Schkufza, Onur Mutlu, and Pratap Subrahmanyam

4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, INFLOW@OSDI 2016, Savannah, GA, USA, November 2016

@inproceedings{abc,
	author = {Himanshu Chauhan and Irina Calciu and Vijay Chidambaram and Eric Schkufza and Onur Mutlu and Pratap Subrahmanyam},
	booktitle = {4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, INFLOW@OSDI 2016, Savannah, GA, USA},
	title = {NVMOVE: Helping Programmers Move to Byte-Based Persistence.},
	url = {https://www.usenix.org/conference/inflow16/workshop-program/presentation/chauhan},
	year = {2016}
}

Yak: A High-Performance Big-Data-Friendly Garbage Collector.

Khanh Nguyen, Lu Fang, Guoqing (Harry) Xu, Brian Demsky, Shan Lu, Sanazsadat Alamian, and Onur Mutlu

12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2016

@inproceedings{abc,
	author = {Khanh Nguyen and Lu Fang and Guoqing (Harry) Xu and Brian Demsky and Shan Lu and Sanazsadat Alamian and Onur Mutlu},
	booktitle = {12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA},
	title = {Yak: A High-Performance Big-Data-Friendly Garbage Collector.},
	url = {https://www.usenix.org/conference/osdi16/technical-sessions/presentation/nguyen},
	year = {2016}
}

Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation.

Kevin Hsieh, Samira Manabi Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu

34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA, October 2016

@inproceedings{abc,
	author = {Kevin Hsieh and Samira Manabi Khan and Nandita Vijaykumar and Kevin K. Chang and Amirali Boroumand and Saugata Ghose and Onur Mutlu},
	booktitle = {34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA},
	title = {Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation.},
	url = {http://dx.doi.org/10.1109/ICCD.2016.7753257},
	year = {2016}
}

A model for Application Slowdown Estimation in on-chip networks and its use for improving system fairness and performance.

Xi-Yue Xiang, Saugata Ghose, Onur Mutlu, and Nian-Feng Tzeng

34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA, October 2016

@inproceedings{abc,
	author = {Xi-Yue Xiang and Saugata Ghose and Onur Mutlu and Nian-Feng Tzeng},
	booktitle = {34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA},
	title = {A model for Application Slowdown Estimation in on-chip networks and its use for improving system fairness and performance.},
	url = {http://dx.doi.org/10.1109/ICCD.2016.7753327},
	year = {2016}
}

Zorua: A holistic approach to resource virtualization in GPUs.

Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Manabi Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, and Onur Mutlu

49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 2016

@inproceedings{abc,
	author = {Nandita Vijaykumar and Kevin Hsieh and Gennady Pekhimenko and Samira Manabi Khan and Ashish Shrestha and Saugata Ghose and Adwait Jog and Phillip B. Gibbons and Onur Mutlu},
	booktitle = {49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan},
	title = {Zorua: A holistic approach to resource virtualization in GPUs.},
	url = {http://dx.doi.org/10.1109/MICRO.2016.7783718},
	year = {2016}
}

Continuous runahead: Transparent hardware acceleration for memory intensive workloads.

Milad Hashemi, Onur Mutlu, and Yale N. Patt

49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 2016

@inproceedings{abc,
	author = {Milad Hashemi and Onur Mutlu and Yale N. Patt},
	booktitle = {49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan},
	title = {Continuous runahead: Transparent hardware acceleration for memory intensive workloads.},
	url = {http://dx.doi.org/10.1109/MICRO.2016.7783764},
	year = {2016}
}

Keynote: rethinking memory system design.

Onur Mutlu

2016 International Symposium on Rapid System Prototyping, RSP 2016, Pittsburg, PA, USA, October 2016

@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {2016 International Symposium on Rapid System Prototyping, RSP 2016, Pittsburg, PA, USA},
	title = {Keynote: rethinking memory system design.},
	url = {https://doi.org/10.1145/2990299.2990300},
	year = {2016}
}

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities.

Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 2016

@inproceedings{abc,
	author = {Ashutosh Pattnaik and Xulong Tang and Adwait Jog and Onur Kayiran and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Chita R. Das},
	booktitle = {Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel},
	title = {Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities.},
	url = {http://doi.acm.org/10.1145/2967938.2967940},
	year = {2016}
}

Î¼C-States: Fine-grained GPU Datapath Power Management.

Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 2016

@inproceedings{abc,
	author = {Onur Kayiran and Adwait Jog and Ashutosh Pattnaik and Rachata Ausavarungnirun and Xulong Tang and Mahmut T. Kandemir and Gabriel H. Loh and Onur Mutlu and Chita R. Das},
	booktitle = {Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel},
	title = {{\^I}{\textonequarter}C-States: Fine-grained GPU Datapath Power Management.},
	url = {http://doi.acm.org/10.1145/2967938.2967941},
	year = {2016}
}

Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler

43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 2016

@inproceedings{abc,
	author = {Kevin Hsieh and Eiman Ebrahimi and Gwangsun Kim and Niladrish Chatterjee and Mike O{\textquoteright}Connor and Nandita Vijaykumar and Onur Mutlu and Stephen W. Keckler},
	booktitle = {43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea},
	title = {Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ISCA.2016.27},
	year = {2016}
}

Accelerating Dependent Cache Misses with an Enhanced Memory Controller.

Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt

43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 2016

@inproceedings{abc,
	author = {Milad Hashemi and Khubaib and Eiman Ebrahimi and Onur Mutlu and Yale N. Patt},
	booktitle = {43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea},
	title = {Accelerating Dependent Cache Misses with an Enhanced Memory Controller.},
	url = {http://dx.doi.org/10.1109/ISCA.2016.46},
	year = {2016}
}

Invited - Who is the major threat to tomorrow's security?: you, the hardware designer.

Wayne P. Burleson, Onur Mutlu, and Mohit Tiwari

Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Austin, TX, USA, June 2016

@inproceedings{abc,
	author = {Wayne P. Burleson and Onur Mutlu and Mohit Tiwari},
	booktitle = {Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Austin, TX, USA},
	title = {Invited - Who is the major threat to tomorrow{\textquoteright}s security?: you, the hardware designer.},
	url = {http://doi.acm.org/10.1145/2897937.2905022},
	year = {2016}
}

PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM.

Samira Manabi Khan, Donghyuk Lee, and Onur Mutlu

46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016, Toulouse, France, June 2016

@inproceedings{abc,
	author = {Samira Manabi Khan and Donghyuk Lee and Onur Mutlu},
	booktitle = {46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016, Toulouse, France},
	title = {PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM.},
	url = {http://dx.doi.org/10.1109/DSN.2016.30},
	year = {2016}
}

Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization.

Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Manabi Khan, and Onur Mutlu

Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France, June 2016

@inproceedings{abc,
	author = {Kevin K. Chang and Abhijith Kashyap and Hasan Hassan and Saugata Ghose and Kevin Hsieh and Donghyuk Lee and Tianshi Li and Gennady Pekhimenko and Samira Manabi Khan and Onur Mutlu},
	booktitle = {Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France},
	title = {Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization.},
	url = {http://doi.acm.org/10.1145/2896377.2901453},
	year = {2016}
}

Exploiting Core Criticality for Enhanced GPU Performance.

Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das

Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France, June 2016

@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Ashutosh Pattnaik and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France},
	title = {Exploiting Core Criticality for Enhanced GPU Performance.},
	url = {http://doi.acm.org/10.1145/2896377.2901468},
	year = {2016}
}

ChargeCache: Reducing DRAM latency by exploiting row access locality.

Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, and Onur Mutlu

2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016

@inproceedings{abc,
	author = {Hasan Hassan and Gennady Pekhimenko and Nandita Vijaykumar and Vivek Seshadri and Donghyuk Lee and Oguz Ergin and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {ChargeCache: Reducing DRAM latency by exploiting row access locality.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446096},
	year = {2016}
}

SizeCap: Efficiently handling power surges in fuel cell powered data centers.

Yang Li, Di Wang, Saugata Ghose, Jie Liu, Sriram Govindan, Sean James, Eric Peterson, John Siegler, Rachata Ausavarungnirun, and Onur Mutlu

2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016

@inproceedings{abc,
	author = {Yang Li and Di Wang and Saugata Ghose and Jie Liu and Sriram Govindan and Sean James and Eric Peterson and John Siegler and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {SizeCap: Efficiently handling power surges in fuel cell powered data centers.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446085},
	year = {2016}
}

Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM.

Kevin K. Chang, Prashant J. Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K. Qureshi, and Onur Mutlu

2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016

@inproceedings{abc,
	author = {Kevin K. Chang and Prashant J. Nair and Donghyuk Lee and Saugata Ghose and Moinuddin K. Qureshi and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446095},
	year = {2016}
}

A case for toggle-aware compression for GPU systems.

Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler

2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016

@inproceedings{abc,
	author = {Gennady Pekhimenko and Evgeny Bolotin and Nandita Vijaykumar and Onur Mutlu and Todd C. Mowry and Stephen W. Keckler},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {A case for toggle-aware compression for GPU systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446064},
	year = {2016}
}

Tiered-Latency DRAM (TL-DRAM).

Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu

CoRR, January 2016

@article{abc,
	author = {Donghyuk Lee and Yoongu Kim and Vivek Seshadri and Jamie Liu and Lavanya Subramanian and Onur Mutlu},
	journal = {CoRR},
	title = {Tiered-Latency DRAM (TL-DRAM).},
	url = {http://arxiv.org/abs/1601.06903},
	year = {2016}
}

Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance.

Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine M. Khessib, Kushagra Vaid, and Onur Mutlu

CoRR, January 2016

@article{abc,
	author = {Yixin Luo and Sriram Govindan and Bikash Sharma and Mark Santaniello and Justin Meza and Aman Kansal and Jie Liu and Badriddine M. Khessib and Kushagra Vaid and Onur Mutlu},
	journal = {CoRR},
	title = {Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance.},
	url = {http://arxiv.org/abs/1602.00729},
	year = {2016}
}

Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM.

Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry

CoRR, January 2016

@article{abc,
	author = {Vivek Seshadri and Donghyuk Lee and Thomas Mullins and Hasan Hassan and Amirali Boroumand and Jeremie Kim and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
	journal = {CoRR},
	title = {Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM.},
	url = {http://arxiv.org/abs/1611.09988},
	year = {2016}
}

Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing.

Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Kai-Wei Chang, Greg Nazario, Reetuparna Das, Gabriel H. Loh, and Onur Mutlu

CoRR, January 2016

@article{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	journal = {CoRR},
	title = {Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing.},
	url = {http://arxiv.org/abs/1602.06005},
	year = {2016}
}

GateKeeper: Enabling Fast Pre-Alignment in DNA Short Read Mapping with a New Streaming Accelerator Architecture.

Mohammed Alser, Hasan Hassan, Hongyi Xin, Oguz Ergin, Onur Mutlu, and Can Alkan

CoRR, January 2016

@article{abc,
	author = {Mohammed Alser and Hasan Hassan and Hongyi Xin and Oguz Ergin and Onur Mutlu and Can Alkan},
	journal = {CoRR},
	title = {GateKeeper: Enabling Fast Pre-Alignment in DNA Short Read Mapping with a New Streaming Accelerator Architecture.},
	url = {http://arxiv.org/abs/1604.01789},
	year = {2016}
}

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps.

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Saugata Ghose, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita R. Das, Mahmut T. Kandemir, Todd C. Mowry, and Onur Mutlu

CoRR, January 2016

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.

@article{abc,
abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.
This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency.
CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps.
We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Saugata Ghose and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
journal = {CoRR},
title = {A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps.},
url = {http://arxiv.org/abs/1602.01348},
year = {2016}
}

A case for hierarchical rings with deflection routing: An energy-efficient on-chip communication substrate.

Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Kai-Wei Chang, Greg Nazario, Reetuparna Das, Gabriel H. Loh, and Onur Mutlu

Parallel Computing, January 2016

@article{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	journal = {Parallel Computing},
	title = {A case for hierarchical rings with deflection routing: An energy-efficient on-chip communication substrate.},
	url = {http://dx.doi.org/10.1016/j.parco.2016.01.009},
	year = {2016}
}

BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling.

Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu

IEEE Trans. Parallel Distrib. Syst., January 2016

@inproceedings{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	booktitle = {IEEE Trans. Parallel Distrib. Syst.},
	title = {BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling.},
	url = {http://dx.doi.org/10.1109/TPDS.2016.2526003},
	year = {2016}
}

Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism.

Kevin Kai-Wei Chang, Gabriel H. Loh, Mithuna Thottethodi, Yasuko Eckert, Mike O'Connor, Srilatha Manne, Lisa Hsu, Lavanya Subramanian, and Onur Mutlu

CoRR, January 2016

@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Gabriel H. Loh and Mithuna Thottethodi and Yasuko Eckert and Mike O{\textquoteright}Connor and Srilatha Manne and Lisa Hsu and Lavanya Subramanian and Onur Mutlu},
	booktitle = {CoRR},
	title = {Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism.},
	url = {http://arxiv.org/abs/1602.00722},
	year = {2016}
}

The 2014 MICRO Test of Time Award Winners: From 1978 to 1992.

Onur Mutlu, Richard A. Belgard, Nick Tredennick, and Mike Schlansker

IEEE Micro, January 2016

@article{abc,
	author = {Onur Mutlu and Richard A. Belgard and Nick Tredennick and Mike Schlansker},
	journal = {IEEE Micro},
	title = {The 2014 MICRO Test of Time Award Winners: From 1978 to 1992.},
	url = {http://dx.doi.org/10.1109/MM.2016.7},
	year = {2016}
}

Optimal seed solver: optimizing seed selection in read mapping.

Hongyi Xin, Sunny Nahar, Richard Zhu, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu

Bioinformatics, January 2016

@inproceedings{abc,
	author = {Hongyi Xin and Sunny Nahar and Richard Zhu and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	booktitle = {Bioinformatics},
	title = {Optimal seed solver: optimizing seed selection in read mapping.},
	url = {http://dx.doi.org/10.1093/bioinformatics/btv670},
	year = {2016}
}

Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory.

Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu

IEEE Journal on Selected Areas in Communications, January 2016

@inproceedings{abc,
	author = {Yixin Luo and Saugata Ghose and Yu Cai and Erich F. Haratsch and Onur Mutlu},
	booktitle = {IEEE Journal on Selected Areas in Communications},
	title = {Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory.},
	url = {http://dx.doi.org/10.1109/JSAC.2016.2603608},
	year = {2016}
}

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads.

Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry

TACO, January 2016

@inproceedings{abc,
	author = {Amir Yazdanbakhsh and Gennady Pekhimenko and Bradley Thwaites and Hadi Esmaeilzadeh and Onur Mutlu and Todd C. Mowry},
	booktitle = {TACO},
	title = {RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads.},
	url = {http://doi.acm.org/10.1145/2836168},
	year = {2016}
}

Bounding and reducing memory interference in COTS-based multi-core systems.

Hyoseung Kim, Dionisio de Niz, Björn Andersson, Mark H. Klein, Onur Mutlu, and Ragunathan Rajkumar

Real-Time Systems, January 2016

@inproceedings{abc,
	author = {Hyoseung Kim and Dionisio de Niz and Bj{\"o}rn Andersson and Mark H. Klein and Onur Mutlu and Ragunathan Rajkumar},
	booktitle = {Real-Time Systems},
	title = {Bounding and reducing memory interference in COTS-based multi-core systems.},
	url = {http://dx.doi.org/10.1007/s11241-016-9248-1},
	year = {2016}
}

The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Initialization, Bitwise AND and OR.

Vivek Seshadri, and Onur Mutlu

CoRR, January 2016

@article{abc,
	author = {Vivek Seshadri and Onur Mutlu},
	journal = {CoRR},
	title = {The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Initialization, Bitwise AND and OR.},
	url = {http://arxiv.org/abs/1610.09603},
	year = {2016}
}

Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost.

Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Manabi Khan, and Onur Mutlu

TACO, January 2016

@inproceedings{abc,
	author = {Donghyuk Lee and Saugata Ghose and Gennady Pekhimenko and Samira Manabi Khan and Onur Mutlu},
	booktitle = {TACO},
	title = {Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost.},
	url = {http://doi.acm.org/10.1145/2832911},
	year = {2016}
}

Adaptive-Latency DRAM (AL-DRAM).

Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Manabi Khan, Vivek Seshadri, Kevin Kai-Wei Chang, and Onur Mutlu

CoRR, January 2016

@article{abc,
	author = {Donghyuk Lee and Yoongu Kim and Gennady Pekhimenko and Samira Manabi Khan and Vivek Seshadri and Kevin Kai-Wei Chang and Onur Mutlu},
	journal = {CoRR},
	title = {Adaptive-Latency DRAM (AL-DRAM).},
	url = {http://arxiv.org/abs/1603.08454},
	year = {2016}
}

Reducing DRAM Latency by Exploiting Design-Induced Latency Variation in Modern DRAM Chips.

Donghyuk Lee, Samira Manabi Khan, Lavanya Subramanian, Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, Saugata Ghose, and Onur Mutlu

CoRR, January 2016

@article{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Saugata Ghose and Onur Mutlu},
	journal = {CoRR},
	title = {Reducing DRAM Latency by Exploiting Design-Induced Latency Variation in Modern DRAM Chips.},
	url = {http://arxiv.org/abs/1610.09604},
	year = {2016}
}

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators.

Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu

TACO, January 2016

@inproceedings{abc,
	author = {Hiroyuki Usui and Lavanya Subramanian and Kevin Kai-Wei Chang and Onur Mutlu},
	booktitle = {TACO},
	title = {DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators.},
	url = {http://doi.acm.org/10.1145/2847255},
	year = {2016}
}

RowHammer: Reliability Analysis and Security Implications.

Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji-Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu

CoRR, January 2016

@article{abc,
	author = {Yoongu Kim and Ross Daly and Jeremie Kim and Chris Fallin and Ji-Hye Lee and Donghyuk Lee and Chris Wilkerson and Konrad Lai and Onur Mutlu},
	journal = {CoRR},
	title = {RowHammer: Reliability Analysis and Security Implications.},
	url = {http://arxiv.org/abs/1603.00747},
	year = {2016}
}

Ramulator: A Fast and Extensible DRAM Simulator.

Yoongu Kim, Weikun Yang, and Onur Mutlu

Computer Architecture Letters, January 2016

@inproceedings{abc,
	author = {Yoongu Kim and Weikun Yang and Onur Mutlu},
	booktitle = {Computer Architecture Letters},
	title = {Ramulator: A Fast and Extensible DRAM Simulator.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2414456},
	year = {2016}
}

Mitigating the Memory Bottleneck With Approximate Load Value Prediction.

Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh, Gennady Pekhimenko, Onur Mutlu, and Todd C. Mowry

IEEE Design Test, January 2016

@article{abc,
	author = {Amir Yazdanbakhsh and Bradley Thwaites and Hadi Esmaeilzadeh and Gennady Pekhimenko and Onur Mutlu and Todd C. Mowry},
	journal = {IEEE Design  Test},
	title = {Mitigating the Memory Bottleneck With Approximate Load Value Prediction.},
	url = {http://dx.doi.org/10.1109/MDAT.2015.2504899},
	year = {2016}
}

Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC Processor.

Onur Mutlu, Richard A. Belgard, Thomas R. Gross, Norman P. Jouppi, John L. Hennessy, Steven A. Przybylski, Chris Rowen, Yale N. Patt, Wen-Mei W. Hwu, Stephen W. Melvin, Michael Shebanow, Tse-Yu Yeh, and Andy Wolfe

IEEE Micro, January 2016

@inproceedings{abc,
	author = {Onur Mutlu and Richard A. Belgard and Thomas R. Gross and Norman P. Jouppi and John L. Hennessy and Steven A. Przybylski and Chris Rowen and Yale N. Patt and Wen-Mei W. Hwu and Stephen W. Melvin and Michael Shebanow and Tse-Yu Yeh and Andy Wolfe},
	booktitle = {IEEE Micro},
	title = {Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC Processor.},
	url = {http://dx.doi.org/10.1109/MM.2016.66},
	year = {2016}
}

Reducing Performance Impact of DRAM Refresh by Parallelizing Refreshes with Accesses.

Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur Mutlu

CoRR, January 2016

@article{abc,
	author = {Kevin Kai-Wei Chang and Donghyuk Lee and Zeshan Chishti and Alaa R. Alameldeen and Chris Wilkerson and Yoongu Kim and Onur Mutlu},
	journal = {CoRR},
	title = {Reducing Performance Impact of DRAM Refresh by Parallelizing Refreshes with Accesses.},
	url = {http://arxiv.org/abs/1601.06352},
	year = {2016}
}

2015

The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory.

Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Manabi Khan, and Onur Mutlu

Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015

@inproceedings{abc,
	author = {Lavanya Subramanian and Vivek Seshadri and Arnab Ghosh and Samira Manabi Khan and Onur Mutlu},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory.},
	url = {http://doi.acm.org/10.1145/2830772.2830803},
	year = {2015}
}

Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses.

Vivek Seshadri, Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry

Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015

@inproceedings{abc,
	author = {Vivek Seshadri and Thomas Mullins and Amirali Boroumand and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses.},
	url = {http://doi.acm.org/10.1145/2830772.2830820},
	year = {2015}
}

ThyNVM: enabling software-transparent crash consistency in persistent memory systems.

Jinglei Ren, Jishen Zhao, Samira Manabi Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu

Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015

@inproceedings{abc,
	author = {Jinglei Ren and Jishen Zhao and Samira Manabi Khan and Jongmoo Choi and Yongwei Wu and Onur Mutlu},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {ThyNVM: enabling software-transparent crash consistency in persistent memory systems.},
	url = {http://doi.acm.org/10.1145/2830772.2830802},
	year = {2015}
}

Rethinking Memory System Design (along with Interconnects).

Onur Mutlu

Proceedings of the 8th International Workshop on Network on Chip Architectures, NoCArc '15, Waikiki, HI, USA, December 2015

@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {Proceedings of the 8th International Workshop on Network on Chip Architectures, NoCArc {\textquoteright}15, Waikiki, HI, USA},
	title = {Rethinking Memory System Design (along with Interconnects).},
	url = {http://doi.acm.org/10.1145/2835512.2835520},
	year = {2015}
}

Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM.

Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, and Onur Mutlu

2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 2015

@inproceedings{abc,
	author = {Donghyuk Lee and Lavanya Subramanian and Rachata Ausavarungnirun and Jongmoo Choi and Onur Mutlu},
	booktitle = {2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA},
	title = {Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM.},
	url = {http://dx.doi.org/10.1109/PACT.2015.51},
	year = {2015}
}

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance.

Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu

2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 2015

@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Saugata Ghose and Onur Kayiran and Gabriel H. Loh and Chita R. Das and Mahmut T. Kandemir and Onur Mutlu},
	booktitle = {2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA},
	title = {Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance.},
	url = {http://dx.doi.org/10.1109/PACT.2015.38},
	year = {2015}
}

A Low-Overhead, Fully-Distributed, Guaranteed-Delivery Routing Algorithm for Faulty Network-on-Chips.

Mohammad Fattah, Antti Airola, Rachata Ausavarungnirun, Nima Mirzaei, Pasi Liljeberg, Juha Plosila, Siamak Mohammadi, Tapio Pahikkala, Onur Mutlu, and Hannu Tenhunen

Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS 2015, Vancouver, BC, Canada, September 2015

@inproceedings{abc,
	author = {Mohammad Fattah and Antti Airola and Rachata Ausavarungnirun and Nima Mirzaei and Pasi Liljeberg and Juha Plosila and Siamak Mohammadi and Tapio Pahikkala and Onur Mutlu and Hannu Tenhunen},
	booktitle = {Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS 2015, Vancouver, BC, Canada},
	title = {A Low-Overhead, Fully-Distributed, Guaranteed-Delivery Routing Algorithm for Faulty Network-on-Chips.},
	url = {http://doi.acm.org/10.1145/2786572.2786591},
	year = {2015}
}

Rethinking memory system design for data-intensive computing.

Onur Mutlu

2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2015, Samos, Greece, July 2015

@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2015, Samos, Greece},
	title = {Rethinking memory system design for data-intensive computing.},
	url = {http://dx.doi.org/10.1109/SAMOS.2015.7363650},
	year = {2015}
}

A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita R. Das, Mahmut T. Kandemir, Todd C. Mowry, and Onur Mutlu

Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate “assist warps” that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.

@inproceedings{abc,
	abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the 
 cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate {\textquotedblleft}assist warps{\textquotedblright} that execute on GPU cores to perform specific tasks that can improve GPU performance and 
 efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
	author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture},
	title = {A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.},
	url = {http://doi.acm.org/10.1145/2749469.2750399},
	venue = {Portland, OR, USA},
	year = {2015}
}

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture.

Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi

Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015

@inproceedings{abc,
	author = {Junwhan Ahn and Sungjoo Yoo and Onur Mutlu and Kiyoung Choi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture.},
	url = {http://doi.acm.org/10.1145/2749469.2750385},
	year = {2015}
}

Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.

Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul M. Chilimbi

Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015

@inproceedings{abc,
	author = {Vivek Seshadri and Gennady Pekhimenko and Olatunji Ruwase and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry and Trishul M. Chilimbi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.},
	url = {http://doi.acm.org/10.1145/2749469.2750379},
	year = {2015}
}

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field.

Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu

45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015

@inproceedings{abc,
	author = {Justin Meza and Qiang Wu and Sanjeev Kumar and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field.},
	url = {http://dx.doi.org/10.1109/DSN.2015.57},
	year = {2015}
}

A scalable processing-in-memory accelerator for parallel graph processing.

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi

Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015

@inproceedings{abc,
	author = {Junwhan Ahn and Sungpack Hong and Sungjoo Yoo and Onur Mutlu and Kiyoung Choi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {A scalable processing-in-memory accelerator for parallel graph processing.},
	url = {http://doi.acm.org/10.1145/2749469.2750386},
	year = {2015}
}

Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery.

Yu Cai, Yixin Luo, Saugata Ghose, and Onur Mutlu

45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015

@inproceedings{abc,
	author = {Yu Cai and Yixin Luo and Saugata Ghose and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery.},
	url = {http://dx.doi.org/10.1109/DSN.2015.49},
	year = {2015}
}

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems.

Moinuddin K. Qureshi, Dae-Hyun Kim, Samira Manabi Khan, Prashant J. Nair, and Onur Mutlu

45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015

@inproceedings{abc,
	author = {Moinuddin K. Qureshi and Dae-Hyun Kim and Samira Manabi Khan and Prashant J. Nair and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems.},
	url = {http://dx.doi.org/10.1109/DSN.2015.58},
	year = {2015}
}

A Large-Scale Study of Flash Memory Failures in the Field.

Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu

Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA, June 2015

@inproceedings{abc,
	author = {Justin Meza and Qiang Wu and Sanjev Kumar and Onur Mutlu},
	booktitle = {Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA},
	title = {A Large-Scale Study of Flash Memory Failures in the Field.},
	url = {http://doi.acm.org/10.1145/2745844.2745848},
	year = {2015}
}

WARM: Improving NAND flash memory lifetime with write-hotness aware retention management.

Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu

IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA, May 2015

@inproceedings{abc,
	author = {Yixin Luo and Yu Cai and Saugata Ghose and Jongmoo Choi and Onur Mutlu},
	booktitle = {IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA},
	title = {WARM: Improving NAND flash memory lifetime with write-hotness aware retention management.},
	url = {http://dx.doi.org/10.1109/MSST.2015.7208284},
	year = {2015}
}

Amnesic cache management for non-volatile memory.

Dongwoo Kang, Seungjae Baek, Jongmoo Choi, Donghee Lee, Sam H. Noh, and Onur Mutlu

IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA, May 2015

@inproceedings{abc,
	author = {Dongwoo Kang and Seungjae Baek and Jongmoo Choi and Donghee Lee and Sam H. Noh and Onur Mutlu},
	booktitle = {IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA},
	title = {Amnesic cache management for non-volatile memory.},
	url = {http://dx.doi.org/10.1109/MSST.2015.7208291},
	year = {2015}
}

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters.

Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, and Onur Mutlu

Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey, March 2015

@inproceedings{abc,
	author = {Hui Wang and Canturk Isci and Lavanya Subramanian and Jongmoo Choi and Depei Qian and Onur Mutlu},
	booktitle = {Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey},
	title = {A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters.},
	url = {http://doi.acm.org/10.1145/2731186.2731202},
	year = {2015}
}

Comparative evaluation of FPGA and ASIC implementations of bufferless and buffered routing algorithms for on-chip networks.

Yu Cai, Ken Mai, and Onur Mutlu

Sixteenth International Symposium on Quality Electronic Design, ISQED 2015, Santa Clara, CA, USA, March 2015

@inproceedings{abc,
	author = {Yu Cai and Ken Mai and Onur Mutlu},
	booktitle = {Sixteenth International Symposium on Quality Electronic Design, ISQED 2015, Santa Clara, CA, USA},
	title = {Comparative evaluation of FPGA and ASIC implementations of bufferless and buffered routing algorithms for on-chip networks.},
	url = {http://dx.doi.org/10.1109/ISQED.2015.7085472},
	year = {2015}
}

Exploiting compressed block size as an indicator of future reuse.

Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry

21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015

@inproceedings{abc,
	author = {Gennady Pekhimenko and Tyler Huberty and Rui Cai and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Exploiting compressed block size as an indicator of future reuse.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056021},
	year = {2015}
}

Data retention in MLC NAND flash memory: Characterization, optimization, and recovery.

Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu

21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015

@inproceedings{abc,
	author = {Yu Cai and Yixin Luo and Erich F. Haratsch and Ken Mai and Onur Mutlu},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Data retention in MLC NAND flash memory: Characterization, optimization, and recovery.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056062},
	year = {2015}
}

Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.

Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Manabi Khan, Vivek Seshadri, Kevin Kai-Wei Chang, and Onur Mutlu

21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015

@inproceedings{abc,
	author = {Donghyuk Lee and Yoongu Kim and Gennady Pekhimenko and Samira Manabi Khan and Vivek Seshadri and Kevin Kai-Wei Chang and Onur Mutlu},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056057},
	year = {2015}
}

Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface.

Donghyuk Lee, Gennady Pekhimenko, Samira Manabi Khan, Saugata Ghose, and Onur Mutlu

CoRR, January 2015

@article{abc,
	author = {Donghyuk Lee and Gennady Pekhimenko and Samira Manabi Khan and Saugata Ghose and Onur Mutlu},
	journal = {CoRR},
	title = {Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface.},
	url = {http://arxiv.org/abs/1506.03160},
	year = {2015}
}

Introducing the MICRO Test of Time Awards: Concept, Process, 2014 Winners, and the Future.

Onur Mutlu, and Richard A. Belgard

IEEE Micro, January 2015

@inproceedings{abc,
	author = {Onur Mutlu and Richard A. Belgard},
	booktitle = {IEEE Micro},
	title = {Introducing the MICRO Test of Time Awards: Concept, Process, 2014 Winners, and the Future.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2015.32},
	year = {2015}
}

Managing Hybrid Main Memories with a Page-Utility Driven Performance Model.

Yang Li, Jongmoo Choi, Jin Sun, Saugata Ghose, Hui Wang, Justin Meza, Jinglei Ren, and Onur Mutlu

CoRR, January 2015

@article{abc,
	author = {Yang Li and Jongmoo Choi and Jin Sun and Saugata Ghose and Hui Wang and Justin Meza and Jinglei Ren and Onur Mutlu},
	journal = {CoRR},
	title = {Managing Hybrid Main Memories with a Page-Utility Driven Performance Model.},
	url = {http://arxiv.org/abs/1507.03303},
	year = {2015}
}

The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity.

Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu

CoRR, January 2015

@article{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	journal = {CoRR},
	title = {The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity.},
	url = {http://arxiv.org/abs/1504.00390},
	year = {2015}
}

Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping.

Hongyi Xin, John Greth, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu

Bioinformatics, January 2015

@inproceedings{abc,
	author = {Hongyi Xin and John Greth and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	booktitle = {Bioinformatics},
	title = {Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping.},
	url = {http://dx.doi.org/10.1093/bioinformatics/btu856},
	year = {2015}
}

High-Performance and Lightweight Transaction Support in Flash-Based SSDs.

Youyou Lu, Jiwu Shu, Jia Guo, Shuai Li, and Onur Mutlu

IEEE Trans. Computers, January 2015

@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Jia Guo and Shuai Li and Onur Mutlu},
	booktitle = {IEEE Trans. Computers},
	title = {High-Performance and Lightweight Transaction Support in Flash-Based SSDs.},
	url = {http://dx.doi.org/10.1109/TC.2015.2389828},
	year = {2015}
}

SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators.

Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu

CoRR, January 2015

@article{abc,
	author = {Hiroyuki Usui and Lavanya Subramanian and Kevin Kai-Wei Chang and Onur Mutlu},
	journal = {CoRR},
	title = {SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators.},
	url = {http://arxiv.org/abs/1505.07502},
	year = {2015}
}

Toggle-Aware Compression for GPUs.

Gennady Pekhimenko, Evgeny Bolotin, Mike O'Connor, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler

Computer Architecture Letters, January 2015

@inproceedings{abc,
	author = {Gennady Pekhimenko and Evgeny Bolotin and Mike O{\textquoteright}Connor and Onur Mutlu and Todd C. Mowry and Stephen W. Keckler},
	booktitle = {Computer Architecture Letters},
	title = {Toggle-Aware Compression for GPUs.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2430853},
	year = {2015}
}

Optimal Seed Solver: Optimizing Seed Selection in Read Mapping.

Hongyi Xin, Richard Zhu, Sunny Nahar, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu

CoRR, January 2015

@article{abc,
	author = {Hongyi Xin and Richard Zhu and Sunny Nahar and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	journal = {CoRR},
	title = {Optimal Seed Solver: Optimizing Seed Selection in Read Mapping.},
	url = {http://arxiv.org/abs/1506.08235},
	year = {2015}
}

Fast Bulk Bitwise AND and OR in DRAM.

Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry

Computer Architecture Letters, January 2015

@inproceedings{abc,
	author = {Vivek Seshadri and Kevin Hsieh and Amirali Boroumand and Donghyuk Lee and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
	booktitle = {Computer Architecture Letters},
	title = {Fast Bulk Bitwise AND and OR in DRAM.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2434872},
	year = {2015}
}

2014

FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems.

Jishen Zhao, Onur Mutlu, and Yuan Xie

47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 2014

@inproceedings{abc,
	author = {Jishen Zhao and Onur Mutlu and Yuan Xie},
	booktitle = {47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom},
	title = {FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems.},
	url = {http://dx.doi.org/10.1109/MICRO.2014.47},
	year = {2014}
}

Managing GPU Concurrency in Heterogeneous Architectures.

Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das

47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 2014

@inproceedings{abc,
	author = {Onur Kayiran and Nachiappan Chidambaram Nachiappan and Adwait Jog and Rachata Ausavarungnirun and Mahmut T. Kandemir and Gabriel H. Loh and Onur Mutlu and Chita R. Das},
	booktitle = {47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom},
	title = {Managing GPU Concurrency in Heterogeneous Architectures.},
	url = {http://dx.doi.org/10.1109/MICRO.2014.62},
	year = {2014}
}

Design and Evaluation of Hierarchical Rings with Deflection Routing.

Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Kai-Wei Chang, Greg Nazario, Reetuparna Das, Gabriel H. Loh, and Onur Mutlu

26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 2014

@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	booktitle = {26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France},
	title = {Design and Evaluation of Hierarchical Rings with Deflection Routing.},
	url = {http://dx.doi.org/10.1109/SBAC-PAD.2014.31},
	year = {2014}
}

Loose-Ordering Consistency for persistent memory.

Youyou Lu, Jiwu Shu, Long Sun, and Onur Mutlu

32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014

@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Long Sun and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {Loose-Ordering Consistency for persistent memory.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974684},
	year = {2014}
}

The heterogeneous block architecture.

Chris Fallin, Chris Wilkerson, and Onur Mutlu

32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014

@inproceedings{abc,
	author = {Chris Fallin and Chris Wilkerson and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {The heterogeneous block architecture.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974710},
	year = {2014}
}

The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost.

Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu

32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014

@inproceedings{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974655},
	year = {2014}
}

Warp-aware trace scheduling for GPUs.

James A. Jablin, Thomas B. Jablin, Onur Mutlu, and Maurice Herlihy

International Conference on Parallel Architectures and Compilation, PACT '14, Edmonton, AB, Canada, August 2014

@inproceedings{abc,
	author = {James A. Jablin and Thomas B. Jablin and Onur Mutlu and Maurice Herlihy},
	booktitle = {International Conference on Parallel Architectures and Compilation, PACT {\textquoteright}14, Edmonton, AB, Canada},
	title = {Warp-aware trace scheduling for GPUs.},
	url = {http://doi.acm.org/10.1145/2628071.2628101},
	year = {2014}
}

Rollback-free value prediction with approximate loads.

Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yazdanbakhsh, Onur Mutlu, Jongse Park, Girish Mururu, and Todd C. Mowry

International Conference on Parallel Architectures and Compilation, PACT '14, Edmonton, AB, Canada, August 2014

@inproceedings{abc,
	author = {Bradley Thwaites and Gennady Pekhimenko and Hadi Esmaeilzadeh and Amir Yazdanbakhsh and Onur Mutlu and Jongse Park and Girish Mururu and Todd C. Mowry},
	booktitle = {International Conference on Parallel Architectures and Compilation, PACT {\textquoteright}14, Edmonton, AB, Canada},
	title = {Rollback-free value prediction with approximate loads.},
	url = {http://doi.acm.org/10.1145/2628071.2628110},
	year = {2014}
}

Neighbor-cell assisted error correction for MLC NAND flash memories.

Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman S. Unsal, Adrián Cristal, and Ken Mai

ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, Austin, TX, June 2014

@inproceedings{abc,
	author = {Yu Cai and Gulay Yalcin and Onur Mutlu and Erich F. Haratsch and Osman S. Unsal and Adri{\'a}n Cristal and Ken Mai},
	booktitle = {ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS {\textquoteright}14, Austin, TX},
	title = {Neighbor-cell assisted error correction for MLC NAND flash memories.},
	url = {http://doi.acm.org/10.1145/2591971.2591994},
	year = {2014}
}

Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors.

Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji-Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu

ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 2014

@inproceedings{abc,
	author = {Yoongu Kim and Ross Daly and Jeremie Kim and Chris Fallin and Ji-Hye Lee and Donghyuk Lee and Chris Wilkerson and Konrad Lai and Onur Mutlu},
	booktitle = {ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA},
	title = {Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors.},
	url = {http://dx.doi.org/10.1109/ISCA.2014.6853210},
	year = {2014}
}

The Dirty-Block Index.

Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry

ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 2014

@inproceedings{abc,
	author = {Vivek Seshadri and Abhishek Bhowmick and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA},
	title = {The Dirty-Block Index.},
	url = {http://dx.doi.org/10.1109/ISCA.2014.6853204},
	year = {2014}
}

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory.

Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine M. Khessib, Kushagra Vaid, and Onur Mutlu

44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 2014

@inproceedings{abc,
	author = {Yixin Luo and Sriram Govindan and Bikash Sharma and Mark Santaniello and Justin Meza and Aman Kansal and Jie Liu and Badriddine M. Khessib and Kushagra Vaid and Onur Mutlu},
	booktitle = {44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA},
	title = {Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory.},
	url = {http://dx.doi.org/10.1109/DSN.2014.50},
	year = {2014}
}

The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study.

Samira Manabi Khan, Donghyuk Lee, Yoongu Kim, Alaa R. Alameldeen, Chris Wilkerson, and Onur Mutlu

ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, Austin, TX, June 2014

@inproceedings{abc,
	author = {Samira Manabi Khan and Donghyuk Lee and Yoongu Kim and Alaa R. Alameldeen and Chris Wilkerson and Onur Mutlu},
	booktitle = {ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS {\textquoteright}14, Austin, TX},
	title = {The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study.},
	url = {http://doi.acm.org/10.1145/2591971.2592000},
	year = {2014}
}

Bounding memory interference delay in COTS-based multi-core systems.

Hyoseung Kim, Dionisio de Niz, Björn Andersson, Mark H. Klein, Onur Mutlu, and Ragunathan Rajkumar

20th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2014, Berlin, Germany, April 2014

@inproceedings{abc,
	author = {Hyoseung Kim and Dionisio de Niz and Bj{\"o}rn Andersson and Mark H. Klein and Onur Mutlu and Ragunathan Rajkumar},
	booktitle = {20th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2014, Berlin, Germany},
	title = {Bounding memory interference delay in COTS-based multi-core systems.},
	url = {http://dx.doi.org/10.1109/RTAS.2014.6925998},
	year = {2014}
}

Improving DRAM performance by parallelizing refreshes with accesses.

Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur Mutlu

20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 2014

@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Donghyuk Lee and Zeshan Chishti and Alaa R. Alameldeen and Chris Wilkerson and Yoongu Kim and Onur Mutlu},
	booktitle = {20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA},
	title = {Improving DRAM performance by parallelizing refreshes with accesses.},
	url = {http://dx.doi.org/10.1109/HPCA.2014.6835946},
	year = {2014}
}

Improving cache performance using read-write partitioning.

Samira Manabi Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez

20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 2014

@inproceedings{abc,
	author = {Samira Manabi Khan and Alaa R. Alameldeen and Chris Wilkerson and Onur Mutlu and Daniel A. Jim{\'e}nez},
	booktitle = {20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA},
	title = {Improving cache performance using read-write partitioning.},
	url = {http://dx.doi.org/10.1109/HPCA.2014.6835954},
	year = {2014}
}

Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories.

HanBin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, and Onur Mutlu

TACO, January 2014

@inproceedings{abc,
	author = {HanBin Yoon and Justin Meza and Naveen Muralimanohar and Norman P. Jouppi and Onur Mutlu},
	booktitle = {TACO},
	title = {Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories.},
	url = {http://doi.acm.org/10.1145/2669365},
	year = {2014}
}

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks.

Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry

TACO, January 2014

@inproceedings{abc,
	author = {Vivek Seshadri and Samihan Yedkar and Hongyi Xin and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {TACO},
	title = {Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks.},
	url = {http://doi.acm.org/10.1145/2677956},
	year = {2014}
}

2013

RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.

Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry

The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 2013

@inproceedings{abc,
	author = {Vivek Seshadri and Yoongu Kim and Chris Fallin and Donghyuk Lee and Rachata Ausavarungnirun and Gennady Pekhimenko and Yixin Luo and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA},
	title = {RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.},
	url = {http://doi.acm.org/10.1145/2540708.2540725},
	year = {2013}
}

Linearly compressed pages: a low-complexity, low-latency main memory compression framework.

Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry

The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 2013

@inproceedings{abc,
	author = {Gennady Pekhimenko and Vivek Seshadri and Yoongu Kim and Hongyi Xin and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA},
	title = {Linearly compressed pages: a low-complexity, low-latency main memory compression framework.},
	url = {http://doi.acm.org/10.1145/2540708.2540724},
	year = {2013}
}

LightTx: A lightweight transactional design in flash-based SSDs to support flexible transactions.

Youyou Lu, Jiwu Shu, Jia Guo, Shuai Li, and Onur Mutlu

2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA, October 2013

@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Jia Guo and Shuai Li and Onur Mutlu},
	booktitle = {2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA},
	title = {LightTx: A lightweight transactional design in flash-based SSDs to support flexible transactions.},
	url = {http://dx.doi.org/10.1109/ICCD.2013.6657033},
	year = {2013}
}

Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation.

Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai

2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA, October 2013

@inproceedings{abc,
	author = {Yu Cai and Onur Mutlu and Erich F. Haratsch and Ken Mai},
	booktitle = {2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA},
	title = {Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation.},
	url = {http://dx.doi.org/10.1109/ICCD.2013.6657034},
	year = {2013}
}

Orchestrated scheduling and prefetching for GPGPUs.

Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das

The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013

@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {Orchestrated scheduling and prefetching for GPGPUs.},
	url = {http://doi.acm.org/10.1145/2485922.2485951},
	year = {2013}
}

An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms.

Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu

The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013

@inproceedings{abc,
	author = {Jamie Liu and Ben Jaiyen and Yoongu Kim and Chris Wilkerson and Onur Mutlu},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms.},
	url = {http://doi.acm.org/10.1145/2485922.2485928},
	year = {2013}
}

Utility-based acceleration of multithreaded applications on asymmetric CMPs.

José A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt

The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013

@inproceedings{abc,
	author = {Jos{\'e} A. Joao and M. Aater Suleman and Onur Mutlu and Yale N. Patt},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {Utility-based acceleration of multithreaded applications on asymmetric CMPs.},
	url = {http://doi.acm.org/10.1145/2485922.2485936},
	year = {2013}
}

A heterogeneous multiple network-on-chip design: an application-aware approach.

Asit K. Mishra, Onur Mutlu, and Chita R. Das

The 50th Annual Design Automation Conference 2013, DAC '13, Austin, TX, USA, May 2013

@inproceedings{abc,
	author = {Asit K. Mishra and Onur Mutlu and Chita R. Das},
	booktitle = {The 50th Annual Design Automation Conference 2013, DAC {\textquoteright}13, Austin, TX, USA},
	title = {A heterogeneous multiple network-on-chip design: an application-aware approach.},
	url = {http://doi.acm.org/10.1145/2463209.2488779},
	year = {2013}
}

Evaluating STT-RAM as an energy-efficient main memory alternative.

Emre Kultursay, Mahmut T. Kandemir, Anand Sivasubramaniam, and Onur Mutlu

2012 IEEE International Symposium on Performance Analysis of Systems Software, Austin, TX, USA, April 2013

@inproceedings{abc,
	author = {Emre Kultursay and Mahmut T. Kandemir and Anand Sivasubramaniam and Onur Mutlu},
	booktitle = {2012 IEEE International Symposium on Performance Analysis of Systems  Software, Austin, TX, USA},
	title = {Evaluating STT-RAM as an energy-efficient main memory alternative.},
	url = {http://dx.doi.org/10.1109/ISPASS.2013.6557176},
	year = {2013}
}

EMERALD: Characterization of emerging applications and algorithms for low-power devices.

Chuanjun Zhang, Glenn G. Ko, Jungwook Choi, Shang-nien Tsai, Minje Kim, Abner Guzmán-Rivera, Rob A. Rutenbar, Paris Smaragdis, Mi Sun Park, Narayanan Vijaykrishnan, Hongyi Xin, Onur Mutlu, Bin Li, Li Zhao, and Mei Chen

2012 IEEE International Symposium on Performance Analysis of Systems Software, Austin, TX, USA, April 2013

@inproceedings{abc,
	author = {Chuanjun Zhang and Glenn G. Ko and Jungwook Choi and Shang-nien Tsai and Minje Kim and Abner Guzm{\'a}n-Rivera and Rob A. Rutenbar and Paris Smaragdis and Mi Sun Park and Narayanan Vijaykrishnan and Hongyi Xin and Onur Mutlu and Bin Li and Li Zhao and Mei Chen},
	booktitle = {2012 IEEE International Symposium on Performance Analysis of Systems  Software, Austin, TX, USA},
	title = {EMERALD: Characterization of emerging applications and algorithms for low-power devices.},
	url = {http://dx.doi.org/10.1109/ISPASS.2013.6557154},
	year = {2013}
}

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance.

Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das

Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, Houston, TX, March 2013

@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Nachiappan Chidambaram Nachiappan and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {Architectural Support for Programming Languages and Operating Systems, ASPLOS {\textquoteright}13, Houston, TX},
	title = {OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance.},
	url = {http://doi.acm.org/10.1145/2451116.2451158},
	year = {2013}
}

Threshold voltage distribution in MLC NAND flash memory: characterization, analysis, and modeling.

Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai

Design, Automation and Test in Europe, DATE 13, Grenoble, France, March 2013

@inproceedings{abc,
	author = {Yu Cai and Erich F. Haratsch and Onur Mutlu and Ken Mai},
	booktitle = {Design, Automation and Test in Europe, DATE 13, Grenoble, France},
	title = {Threshold voltage distribution in MLC NAND flash memory: characterization, analysis, and modeling.},
	year = {2013}
}

Application-to-core mapping policies to reduce memory system interference in multi-core systems.

Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi

19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013

@inproceedings{abc,
	author = {Reetuparna Das and Rachata Ausavarungnirun and Onur Mutlu and Akhilesh Kumar and Mani Azimi},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {Application-to-core mapping policies to reduce memory system interference in multi-core systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522311},
	year = {2013}
}

Tiered-latency DRAM: A low latency and low cost DRAM architecture.

Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu

19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013

@inproceedings{abc,
	author = {Donghyuk Lee and Yoongu Kim and Vivek Seshadri and Jamie Liu and Lavanya Subramanian and Onur Mutlu},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {Tiered-latency DRAM: A low latency and low cost DRAM architecture.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522354},
	year = {2013}
}

MISE: Providing performance predictability and improving fairness in shared main memory systems.

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu

19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013

@inproceedings{abc,
	author = {Lavanya Subramanian and Vivek Seshadri and Yoongu Kim and Ben Jaiyen and Onur Mutlu},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {MISE: Providing performance predictability and improving fairness in shared main memory systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522356},
	year = {2013}
}

Accelerating read mapping with FastHASH.

Hongyi Xin, Donghyuk Lee, Farhad Hormozdiari, Samihan Yedkar, Onur Mutlu, and Can Alkan

BMC Genomics, January 2013

@article{abc,
	author = {Hongyi Xin and Donghyuk Lee and Farhad Hormozdiari and Samihan Yedkar and Onur Mutlu and Can Alkan},
	journal = {BMC Genomics},
	title = {Accelerating read mapping with FastHASH.},
	url = {http://dx.doi.org/10.1186/1471-2164-14-S1-S13},
	year = {2013}
}

2012

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks.

Kevin Kai-Wei Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu

IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, New York, NY, USA, October 2012

@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Rachata Ausavarungnirun and Chris Fallin and Onur Mutlu},
	booktitle = {IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, New York, NY, USA},
	title = {HAT: Heterogeneous Adaptive Throttling for On-Chip Networks.},
	url = {http://doi.ieeecomputersociety.org/10.1109/SBAC-PAD.2012.44},
	year = {2012}
}

Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime.

Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrián Cristal, Osman S. Unsal, and Ken Mai

30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012

@inproceedings{abc,
	author = {Yu Cai and Gulay Yalcin and Onur Mutlu and Erich F. Haratsch and Adri{\'a}n Cristal and Osman S. Unsal and Ken Mai},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICCD.2012.6378623},
	year = {2012}
}

A case for small row buffers in non-volatile main memories.

Justin Meza, Jing Li, and Onur Mutlu

30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012

@inproceedings{abc,
	author = {Justin Meza and Jing Li and Onur Mutlu},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {A case for small row buffers in non-volatile main memories.},
	url = {http://dx.doi.org/10.1109/ICCD.2012.6378685},
	year = {2012}
}

Row buffer locality aware caching policies for hybrid memories.

HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding, and Onur Mutlu

30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012

@inproceedings{abc,
	author = {HanBin Yoon and Justin Meza and Rachata Ausavarungnirun and Rachael Harding and Onur Mutlu},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {Row buffer locality aware caching policies for hybrid memories.},
	url = {http://dx.doi.org/10.1109/ICCD.2012.6378661},
	year = {2012}
}

Application-to-core mapping policies to reduce memory interference in multi-core systems.

Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi