Publications by Hasan Hassan
2018
Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018
Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55◦C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70◦C and 1426x (868x, 1783x) at 55◦C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.
@inproceedings{abc, abstract = {Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55{\textopenbullet}C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70{\textopenbullet}C and 1426x (868x, 1783x) at 55{\textopenbullet}C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.}, author = {Jeremie Kim and Minesh Patel and Hasan Hassan and Onur Mutlu}, booktitle = {Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA)}, title = {The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Mod...}, venue = {Vienna, Austria}, year = {2018} }
2017
Bioinformatics, November 2017
Motivation
High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and ‘candidate’ locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper’s execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms.
Results
We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.
@article{abc, abstract = {Motivation High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and {\textquoteleft}candidate{\textquoteright} locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper{\textquoteright}s execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. Results We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96\%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.}, author = {Mohammed Alser and Hasan Hassan and Hongyi Xin and Oguz Ergin and Onur Mutlu and Can Alkan}, pages = {3355-3363}, journal = {Bioinformatics}, title = {GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping}, volume = {33}, year = {2017} }
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017
Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).
To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus.
Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.
@inproceedings{abc, abstract = {Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory). To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1\% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus. Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.}, author = {Vivek Seshadri and Donghyuk Lee and Thomas Mullins and Hasan Hassan and Amirali Boroumand and Jeremie Kim and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry}, booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture}, title = {Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology}, venue = {Cambridge, MA, USA}, year = {2017} }
Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA, June 2017
@inproceedings{abc, author = {Kevin K. Chang and Abdullah Giray Yaglik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu}, booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA}, title = {Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.}, url = {http://doi.acm.org/10.1145/3078505.3078590}, year = {2017} }
2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 2017
@inproceedings{abc, author = {Hasan Hassan and Nandita Vijaykumar and Samira Manabi Khan and Saugata Ghose and Kevin K. Chang and Gennady Pekhimenko and Donghyuk Lee and Oguz Ergin and Onur Mutlu}, booktitle = {2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA}, title = {SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.}, url = {https://doi.org/10.1109/HPCA.2017.62}, year = {2017} }
Computer Architecture Letters, January 2017
@inproceedings{abc, author = {Amirali Boroumand and Saugata Ghose and Minesh Patel and Hasan Hassan and Brandon Lucia and Kevin Hsieh and Krishna T. Malladi and Hongzhong Zheng and Onur Mutlu}, booktitle = {Computer Architecture Letters}, title = {LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory.}, url = {https://doi.org/10.1109/LCA.2016.2577557}, year = {2017} }
CoRR, January 2017
@article{abc, author = {Amirali Boroumand and Saugata Ghose and Minesh Patel and Hasan Hassan and Brandon Lucia and Nastaran Hajinazar and Kevin Hsieh and Krishna T. Malladi and Hongzhong Zheng and Onur Mutlu}, journal = {CoRR}, title = {LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures.}, url = {http://arxiv.org/abs/1706.03162}, year = {2017} }
CoRR, January 2017
@inproceedings{abc, author = {Kevin K. Chang and Abdullah Giray Yaglik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu}, booktitle = {CoRR}, title = {Understanding Reduced-Voltage Operation in Modern DRAM Chips: Characterization, Analysis, and Mechanisms.}, url = {http://arxiv.org/abs/1705.10292}, year = {2017} }
POMACS, January 2017
@article{abc, author = {Kevin K. Chang and A. Giray Yaalik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu}, journal = {POMACS}, title = {Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.}, url = {http://doi.acm.org/10.1145/3084447}, year = {2017} }
2016
Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France, June 2016
@inproceedings{abc, author = {Kevin K. Chang and Abhijith Kashyap and Hasan Hassan and Saugata Ghose and Kevin Hsieh and Donghyuk Lee and Tianshi Li and Gennady Pekhimenko and Samira Manabi Khan and Onur Mutlu}, booktitle = {Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France}, title = {Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization.}, url = {http://doi.acm.org/10.1145/2896377.2901453}, year = {2016} }
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc, author = {Hasan Hassan and Gennady Pekhimenko and Nandita Vijaykumar and Vivek Seshadri and Donghyuk Lee and Oguz Ergin and Onur Mutlu}, booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain}, title = {ChargeCache: Reducing DRAM latency by exploiting row access locality.}, url = {http://dx.doi.org/10.1109/HPCA.2016.7446096}, year = {2016} }
CoRR, January 2016
@article{abc, author = {Mohammed Alser and Hasan Hassan and Hongyi Xin and Oguz Ergin and Onur Mutlu and Can Alkan}, journal = {CoRR}, title = {GateKeeper: Enabling Fast Pre-Alignment in DNA Short Read Mapping with a New Streaming Accelerator Architecture.}, url = {http://arxiv.org/abs/1604.01789}, year = {2016} }
CoRR, January 2016
@article{abc, author = {Vivek Seshadri and Donghyuk Lee and Thomas Mullins and Hasan Hassan and Amirali Boroumand and Jeremie Kim and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry}, journal = {CoRR}, title = {Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM.}, url = {http://arxiv.org/abs/1611.09988}, year = {2016} }