Publications

×

Status message

The Publications site is currently under construction, as a result some publications might be missing.

2020

Proceedings of the 2020 International Conference on Management of Data (SIGMOD'20), Portland, OR, USA, June 2020
Business Rule Management Systems (BRMSs) are widely used in industry for a variety of tasks. Their main advantage is to codify in a succinct and queryable manner vast amounts of constantly evolving logic. In BRMSs, rules are typically captured as facts (tuples) over a collection of criteria, and checking them involves querying the collection of rules to find the best match. In this paper, we focus on a real-world use case from the airline industry: determining the minimum connection time (MCT) between flights. The MCT module is part of the flight search engine, and captures the ever changing constraints at each airport that determine the time to allocate between an arriving and a departing flight for a connection to be feasible. We explore how to use hardware acceleration to (i) improve the performance of the MCT module (lower latency, higher throughput); and (ii) reduce the amount of computing resources needed. A key aspect of the solution is the transformation of a collection of rules into a Non-deterministic Finite state Automaton efficiently implemented on FPGA. Experiments performed on-premises and in the cloud show several orders of magnitude improvement over the existing solution, and the potential to reduce by 40% the number of machines needed for the flight search engine.
@inproceedings{abc,
	abstract = {Business Rule Management Systems (BRMSs) are widely used in industry for a variety of tasks. Their main advantage is to codify in a succinct and queryable manner vast amounts of constantly evolving logic. In BRMSs, rules are typically captured as facts (tuples) over a collection of criteria, and checking them involves querying the collection of rules to find the best match. In this paper, we focus on a real-world use case from the airline industry: determining the minimum connection time (MCT) between flights. The MCT module is part of the flight search engine, and captures the ever changing constraints at each airport that determine the time to allocate between an arriving and a departing flight for a connection to be feasible. We explore how to use hardware acceleration to (i) improve the performance of the MCT module (lower latency, higher throughput); and (ii) reduce the amount of computing resources needed. A key aspect of the solution is the transformation of a collection of rules into a Non-deterministic Finite state Automaton efficiently implemented on FPGA. Experiments performed on-premises and in the cloud show several orders of magnitude improvement over the existing solution, and the potential to reduce by 40\% the number of machines needed for the flight search engine.},
	author = {Fabio Maschi and Muhsen Owaida and Gustavo Alonso and Matteo Casalino and Anthony Hock-Koon},
	booktitle = {Proceedings of the 2020 International Conference on Management of Data (SIGMOD{\textquoteright}20)},
	title = {Making Search Engines Faster by Lowering the Cost of Querying Business Rules Through FPGAs},
	url = {https://doi.org/10.1145/3318464.3386133},
	venue = {Portland, OR, USA},
	year = {2020}
}

2019

Proceedings of the VLDB 2019, Los Angeles, CA, USA, August 2019
We design and implement Megaphone, a data migration mechanism for stateful distributed dataflow engines with latency objectives. When compared to existing migration mechanisms, Megaphone has the following differentiating characteristics: (i) migrations can be subdivided to a configurable granularity to avoid latency spikes, and (ii) migrations can be prepared ahead of time to avoid runtime coordination. Megaphone is implemented as a library on an unmodified timely dataflow implementation, and provides an operator interface compatible with its existing APIs.We evaluate Megaphone on established benchmarks with varying amounts of state and observe that compared to naïve approaches Megaphone reduces service latencies during reconfiguration by orders of magnitude without significantly increasing steady-state overhead.
@inproceedings{abc,
	abstract = {We design and implement Megaphone, a data migration mechanism
for stateful distributed dataflow engines with latency objectives.
When compared to existing migration mechanisms, Megaphone has
the following differentiating characteristics: (i) migrations can be
subdivided to a configurable granularity to avoid latency spikes, and
(ii) migrations can be prepared ahead of time to avoid runtime coordination.
Megaphone is implemented as a library on an unmodified
timely dataflow implementation, and provides an operator interface
compatible with its existing APIs.We evaluate Megaphone on established
benchmarks with varying amounts of state and observe that
compared to na{\"\i}ve approaches Megaphone reduces service latencies
during reconfiguration by orders of magnitude without significantly
increasing steady-state overhead.},
	author = {Moritz Hoffmann and Andrea Lattuada and Frank McSherry},
	booktitle = {Proceedings of the VLDB 2019},
	title = {Megaphone: Latency-conscious state migration for distributed streaming dataflows},
	venue = {Los Angeles, CA, USA},
	year = {2019}
}
Proceedings of BIRTE Workshop 2019, Los Angeles, CA, USA, August 2019
We explore the performance and resource trade-offs of two alternative approaches to streaming state management. When the state size exceeds the amount of available memory, systems can either scale out and partition the state across distributed computing nodes or rely on secondary storage and divide the state into ‘hot’ and ‘cold’ sets. Scaling out a streaming computation might introduce coordination overhead among parallel workers, while flushing state to disk requires efficient data structures and careful caching policies to minimise expensive I/O. To study the characteristics of these state management approaches, we present an integration of the Timely Dataflow stream processing engine with the FASTER embedded keyvalue store. We demonstrate a prototype that allows users to transparently maintain arbitrary larger-than-memory state with low overhead by making only minimal changes to application code. Our preliminary experimental results show that managed state incurs acceptable overhead over built-in in-memory data structures and, in some cases, performs better when relying on secondary storage in a
@inproceedings{abc,
	abstract = {We explore the performance and resource trade-offs of two alternative
approaches to streaming state management. When
the state size exceeds the amount of available memory, systems
can either scale out and partition the state across distributed
computing nodes or rely on secondary storage and
divide the state into {\textquoteleft}hot{\textquoteright} and {\textquoteleft}cold{\textquoteright} sets. Scaling out a streaming
computation might introduce coordination overhead
among parallel workers, while flushing state to disk requires
efficient data structures and careful caching policies to minimise
expensive I/O.
To study the characteristics of these state management approaches,
we present an integration of the Timely Dataflow
stream processing engine with the FASTER embedded keyvalue
store. We demonstrate a prototype that allows users to
transparently maintain arbitrary larger-than-memory state
with low overhead by making only minimal changes to application
code. Our preliminary experimental results show
that managed state incurs acceptable overhead over built-in
in-memory data structures and, in some cases, performs better
when relying on secondary storage in a},
	author = {Matthew Brokes and Vasiliki  Kalavri and John Liagouris},
	booktitle = {Proceedings of BIRTE Workshop 2019},
	title = {FASTER State Management for Timely Dataflow},
	venue = {Los Angeles, CA, USA},
	year = {2019}
}
Proceedings of the VLDB 2019, Los Angeles, CA, USA, August 2019
Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency. However, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address this issue, we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models over low precision data. MLWeaving provides a compact in-memory representation that enables the retrieval of data at any level of precision. MLWeaving also provides a highly efficient implementation of stochastic gradient descent on FPGAs and enables the dynamic tuning of precision, instead of using a fixed precision level during learning. Experimental results show that MLWeaving converges up to 16 faster than low-precision implementations of first-order methods on CPUs.
@inproceedings{abc,
	abstract = {Learning from the data stored in a database is an important function
increasingly available in relational engines. Methods using
lower precision input data are of special interest given their overall
higher efficiency. However, in databases, these methods have a
hidden cost: the quantization of the real value into a smaller number
is an expensive step. To address this issue, we present MLWeaving,
a data structure and hardware acceleration technique intended
to speed up learning of generalized linear models over low
precision data. MLWeaving provides a compact in-memory representation
that enables the retrieval of data at any level of precision.
MLWeaving also provides a highly efficient implementation
of stochastic gradient descent on FPGAs and enables the dynamic
tuning of precision, instead of using a fixed precision level during
learning. Experimental results show that MLWeaving converges
up to 16 faster than low-precision implementations of first-order
methods on CPUs. },
	author = {Zeke Wang and Kaan Kara and  and Gustavo Alonso and Onur  Mutlu and Ce Zhang},
	booktitle = {Proceedings of the VLDB 2019},
	title = {Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning },
	url = {https://dl.acm.org/doi/10.14778/3317315.3317322},
	venue = {Los Angeles, CA, USA},
	year = {2019}
}
Proceedings of the VLDB 2019, Los Angeles, CA, USA, August 2019
The ability to perform machine learning (ML) tasks in a database management system (DBMS) provides the data analyst with a powerful tool. Unfortunately, integration of ML into a DBMS is challenging for reasons varying from differences in execution model to data layout requirements. In this paper, we assume a column-store main-memory DBMS, optimized for online analytical processing, as our initial system. On this system, we explore the integration of coordinate-descent based methods working natively on columnar format to train generalized linear models. We use a cache-efficient, partitioned stochastic coordinate descent algorithm providing linear throughput scalability with the number of cores while preserving convergence quality, up to 14 cores in our experiments. Existing column oriented DBMS rely on compression and even encryption to store data in memory. When those features are considered, the performance of a CPU based solution suffers. Thus, in the paper we also show how to exploit hardware acceleration as part of a hybrid CPU+FPGA system to provide on-the-fly data transformation combined with an FPGA-based coordinate-descent engine. The resulting system is a column-store DBMS with its important features preserved (e.g., data compression) that offers high performance machine learning capabilities.
@inproceedings{abc,
	abstract = {The ability to perform machine learning (ML) tasks in a database
management system (DBMS) provides the data analyst with a powerful
tool. Unfortunately, integration of ML into a DBMS is challenging
for reasons varying from differences in execution model to
data layout requirements. In this paper, we assume a column-store
main-memory DBMS, optimized for online analytical processing,
as our initial system. On this system, we explore the integration of
coordinate-descent based methods working natively on columnar
format to train generalized linear models. We use a cache-efficient,
partitioned stochastic coordinate descent algorithm providing linear
throughput scalability with the number of cores while preserving
convergence quality, up to 14 cores in our experiments.
Existing column oriented DBMS rely on compression and even
encryption to store data in memory. When those features are considered,
the performance of a CPU based solution suffers. Thus,
in the paper we also show how to exploit hardware acceleration
as part of a hybrid CPU+FPGA system to provide on-the-fly data
transformation combined with an FPGA-based coordinate-descent
engine. The resulting system is a column-store DBMS with its important
features preserved (e.g., data compression) that offers high
performance machine learning capabilities.},
	author = {Kaan Kara and Ken  Eguro and Ce Zhang and Gustavo Alonso},
	booktitle = {Proceedings of the VLDB 2019},
	title = { ColumnML: Column Store Machine Learning with On-the-Fly Data Transformation},
	venue = {Los Angeles, CA, USA},
	year = {2019}
}
Proceedings of HotCloud 2019, Washington, US, July 2019
Cloud providers and their tenants have a mutual interest in identifying optimal configurations in which to run tenant jobs, i.e., ones that achieve tenants' performance goals at minimum cost; or ones that maximize performance within a specified budget. However, different tenants may have different performance goals that are opaque to the provider. A consequence of this opacity is that providers today typically offer fixed bundles of cloud resources, which tenants must themselves explore and choose from. This is burdensome for tenants and can lead to choices that are sub-optimal for both parties. We thus explore a simple, minimal interface, which lets tenants communicate their happiness with cloud infrastructure to the provider, and enables the provider to explore resource configurations that maximize this happiness. Our early results indicate that this interface could strike a good balance between enabling efficient discovery of application resource needs and the complexity of communicating a full description of tenant utility from different configurations to the provider.
@inproceedings{abc,
	abstract = {Cloud providers and their tenants have a mutual interest in identifying optimal configurations in which to run tenant jobs, i.e., ones that achieve tenants{\textquoteright} performance goals at minimum cost; or ones that maximize performance within a specified budget. However, different tenants may have different performance goals that are opaque to the provider. A consequence of this opacity is that providers today typically offer fixed bundles of cloud resources, which tenants must themselves explore and choose from. This is burdensome for tenants and can lead to choices that are sub-optimal for both parties.

We thus explore a simple, minimal interface, which lets tenants communicate their happiness with cloud infrastructure to the provider, and enables the provider to explore resource configurations that maximize this happiness. Our early results indicate that this interface could strike a good balance between enabling efficient discovery of application resource needs and the complexity of communicating a full description of tenant utility from different configurations to the provider.
},
	author = {Vojislav Dukic and Ankit Singla},
	booktitle = {Proceedings of HotCloud 2019},
	title = {Happiness index: Right-sizing the cloud{\textquoteright}s tenant-provider interface},
	url = {https://www.usenix.org/conference/hotcloud19/presentation/dukic},
	venue = {Washington, US},
	year = {2019}
}
Proceedings of the NAACL-HLT 2019, Minneapolis, USA., June 2019
Previous research shows that eye-tracking data contains information about the lexical and syntactic properties of text, which can be used to improve natural language processing models. In this work, we leverage eye movement features from three corpora with recorded gaze information to augment a state-of-the-art neural model for named entity recognition (NER) with gaze embeddings. These corpora were manually annotated with named entity labels. Moreover, we show how gaze features, generalized on word type level, eliminate the need for recorded eye-tracking data at test time. The gaze-augmented models for NER using tokenlevel and type-level features outperform the baselines. We present the benefits of eyetracking features by evaluating the NER models on both individual datasets as well as in cross-domain settings.
@inproceedings{abc,
	abstract = {Previous research shows that eye-tracking data
contains information about the lexical and syntactic
properties of text, which can be used to
improve natural language processing models.
In this work, we leverage eye movement features
from three corpora with recorded gaze
information to augment a state-of-the-art neural
model for named entity recognition (NER)
with gaze embeddings. These corpora were
manually annotated with named entity labels.
Moreover, we show how gaze features, generalized
on word type level, eliminate the need
for recorded eye-tracking data at test time. The
gaze-augmented models for NER using tokenlevel
and type-level features outperform the
baselines. We present the benefits of eyetracking
features by evaluating the NER models
on both individual datasets as well as in
cross-domain settings.},
	author = {Nora  Hollenstein and Ce Zhang},
	booktitle = {Proceedings of the NAACL-HLT 2019},
	title = {Entity Recognition at First Sight: Improving NER with Eye Movement Information},
	venue = {Minneapolis, USA.},
	year = {2019}
}
Springer Quantum Machine Intelligence, May 2019
Bayesian methods in machine learning, such as Gaussian processes, have great advantages compared to other techniques. In particular, they provide estimates of the uncertainty associated with a prediction. Extending the Bayesian approach to deep architectures has remained a major challenge. Recent results connected deep feedforward neural networks with Gaussian processes, allowing training without backpropagation. This connection enables us to leverage a quantum algorithm designed for Gaussian processes and develop a new algorithm for Bayesian deep learning on quantum computers. The properties of the kernel matrix in the Gaussian process ensure the efficient execution of the core component of the protocol, quantum matrix inversion, providing at least a polynomial speedup over classical algorithms. Furthermore, we demonstrate the execution of the algorithm on contemporary quantum computers and analyze its robustness with respect to realistic noise models.
@article{abc,
	abstract = {Bayesian methods in machine learning, such as Gaussian processes, have great advantages compared to other techniques. In particular, they provide estimates of the uncertainty associated with a prediction. Extending the Bayesian approach to deep architectures has remained a major challenge. Recent results connected deep feedforward neural networks with Gaussian processes, allowing training without backpropagation. This connection enables us to leverage a quantum algorithm designed for Gaussian processes and develop a new algorithm for Bayesian deep learning on quantum computers. The properties of the kernel matrix in the Gaussian process ensure the efficient execution of the core component of the protocol, quantum matrix inversion, providing at least a polynomial speedup over classical algorithms. Furthermore, we demonstrate the execution of the algorithm on contemporary quantum computers and analyze its robustness with respect to realistic noise models.},
	author = {Zhikuan Zhao and Alejandro Pozas-Kerstjens and Patrick Rebentrost and Peter Wittek},
	pages = {1-11},
	journal = {Springer Quantum Machine Intelligence},
	title = {Bayesian deep learning on a quantum computer},
	url = {https://link.springer.com/article/10.1007/s42484-019-00004-7},
	year = {2019}
}
Proceedings of the SysML'19, Stanford, CA, USA, April 2019
Continuous integration is an indispensable step of modern software engineering practices to systematically manage the life cycles of system development. Developing a machine learning model is no difference— it is an engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However, most, if not all, existing continuous integration engines do not support machine learning as first-class citizens. In this paper, we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine learning. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design a domain specific language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions popularly used in real production systems.
@inproceedings{abc,
	abstract = {Continuous integration is an indispensable step of modern software engineering practices to systematically
manage the life cycles of system development. Developing a machine learning model is no difference{\textemdash} it is an
engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However,
most, if not all, existing continuous integration engines do not support machine learning as first-class citizens.
In this paper, we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine
learning. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point
error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design
a domain specific language that allows users to specify integration conditions with reliability constraints, and
develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude
for test conditions popularly used in real production systems.},
	author = {Cedric Renggli and Bojan  Karlas and Bolin Ding and Feng  Liu and Kevin Schawinski and Wentao Wu and Ce Zhang},
	booktitle = {Proceedings of the SysML{\textquoteright}19},
	title = {Continuous Integration of Machine Learning Models: A Rigorous Yet Practical Treatment},
	venue = {Stanford, CA, USA},
	year = {2019}
}

2018

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.
@inproceedings{abc,
	abstract = {Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. 
In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp{\textquoteright}s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8{\texttimes} larger capacity and improving overall GPU performance by 31\% while reducing register file power consumption by 46\%.},
	author = {Mohammad Sadrosadati and Amirhossein Mirhosseini and Seyed B. Ehsani and Hamid Sarbazi-Azad and Mario Drumond and Babak Falsafi and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching},
	url = {https://dl.acm.org/citation.cfm?id=3173211},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Resource management strategies for many-core systems need to enable sharing of resources such as power, processing cores, and memory bandwidth while coordinating the priority and significance of system- and application-level objectives at runtime in a scalable and robust manner. State-of-the-art approaches use heuristics or machine learning for resource management, but unfortunately lack formalism in providing robustness against unexpected corner cases. While recent efforts deploy classical control-theoretic approaches with some guarantees and formalism, they lack scalability and autonomy to meet changing runtime goals. We present SPECTR, a new resource management approach for many-core systems that leverages formal supervisory control theory (SCT) to combine the strengths of classical control theory with state-of-the-art heuristic approaches to efficiently meet changing runtime goals. SPECTR is a scalable and robust control architecture and a systematic design flow for hierarchical control of many-core systems. SPECTR leverages SCT techniques such as gain scheduling to allow autonomy for individual controllers. It facilitates automatic synthesis of the high-level supervisory controller and its property verification. We implement SPECTR on an Exynos platform containing ARM’s big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that SPECTR’s use of SCT is key to managing multiple interacting resources (e.g., chip power and processing cores) in the presence of competing objectives (e.g., satisfying QoS vs. power capping). The principles of SPECTR are easily applicable to any resource type and objective as long as the management problem can be modeled using dynamical systems theory (e.g., difference equations), discrete-event dynamic systems, or fuzzy dynamics.
@inproceedings{abc,
	abstract = {Resource management strategies for many-core systems need to enable sharing of resources such as power, processing cores, and memory bandwidth while coordinating the priority and significance of system- and application-level objectives at runtime in a scalable and robust manner. State-of-the-art approaches use heuristics or machine learning for resource management, but unfortunately lack formalism in providing robustness against unexpected corner cases. While recent efforts deploy classical control-theoretic approaches with some guarantees and formalism, they lack scalability and autonomy to meet changing runtime goals. We present SPECTR, a new resource management approach for many-core systems that leverages formal supervisory control theory (SCT) to combine the strengths of classical control theory with state-of-the-art heuristic approaches to efficiently meet changing runtime goals. SPECTR is a scalable and robust control architecture and a systematic design flow for hierarchical control of many-core systems. SPECTR leverages SCT techniques such as gain scheduling to allow autonomy for individual controllers. It facilitates automatic synthesis of the high-level supervisory controller and its property verification. We implement SPECTR on an Exynos platform containing ARM{\textquoteright}s big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that SPECTR{\textquoteright}s use of SCT is key to managing multiple interacting resources (e.g., chip power and processing cores) in the presence of competing objectives (e.g., satisfying QoS vs. power capping). The principles of SPECTR are easily applicable to any resource type and objective as long as the management problem can be modeled using dynamical systems theory (e.g., difference equations), discrete-event dynamic systems, or fuzzy dynamics.},
	author = {Amir M. Rahmani and Bryan Donyanavard and Tiago Mück and Kasra Moazzemi and Axel Jantsch and Onur Mutlu and Nikil Dutt},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {SPECTR: Formal Supervisory Control and Coordination for Many-core Systems Resource Management},
	url = {https://dl.acm.org/citation.cfm?id=3173199},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google’s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-inmemory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4% across the workloads) and execution time (by an average of 54.2%).
@inproceedings{abc,
	abstract = {We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google{\textquoteright}s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-inmemory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4\% across the workloads) and execution time (by an average of 54.2\%).},
	author = {Amirali Boroumand and Saugata Ghose and Youngsok Kim and Rachata Ausavarungnirun and Eric Shiu and Rahul Thakur and Dae-Hyun Kim and Aki Kuusela and Allan Knies and Parthasarathy Ranganathan and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.
@inproceedings{abc,
	abstract = {Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficiency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the state-of-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with state-of-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energy-efficient NoC topologies.},
	author = {Maciej Besta and Syed M. Hassan and Sudhakar Yalamanchili and Rachata Ausavarungnirun and Onur Mutlu and Torsten Hoefler},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018
Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK’s system throughput is within 23.2% of an ideal GPU system with no address translation overhead.
@inproceedings{abc,
	abstract = {Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8\%, improves IPC throughput by 43.4\%, and reduces applicationlevel unfairness by 22.4\%. MASK{\textquoteright}s system throughput is within 23.2\% of an ideal GPU system with no address translation overhead.},
	author = {Rachata Ausavarungnirun and Vance Miller and Joshua Landgraf and Saugata Ghose and Jayneel Gandhi and Adwait Jog and Christopher Rossbach and Onur Mutlu},
	booktitle = {Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
	title = {MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency},
	venue = {Williamsburg, VA, USA},
	year = {2018}
}
Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018
NAND flash memory density continues to scale to keep up with the increasing storage demands of data-intensive applications. Unfortunately, as a result of this scaling, the lifetime of NAND flash memory has been decreasing. Each cell in NAND flash memory can endure only a limited number of writes, due to the damage caused by each program and erase operation on the cell. This damage can be partially repaired on its own during the idle time between program or erase operations (known as the dwell time), via a phenomenon known as the self-recovery effect. Prior works study the self-recovery effect for planar (i.e., 2D) NAND flash memory, and propose to exploit it to improve flash lifetime, by applying high temperature to accelerate selfrecovery. However, these findings may not be directly applicable to 3D NAND flash memory, due to significant changes in the design and manufacturing process that are required to enable practical 3D stacking for NAND flash memory. In this paper, we perform the first detailed experimental characterization of the effects of self-recovery and temperature on real, state-of-the-art 3D NAND flash memory devices. We show that these effects influence two major factors of NAND flash memory reliability: (1) retention loss speed (i.e., the speed at which a flash cell leaks charge), and (2) program variation (i.e., the difference in programming speed across flash cells). We find that self-recovery and temperature affect 3D NAND flash memory quite differently than they affect planar NAND flash memory, rendering prior models of self-recovery and temperature ineffective for 3D NAND flash memory. Using our characterization results, we develop a new model for 3D NAND flash memory reliability, which predicts how retention, wearout, self-recovery, and temperature affect raw bit error rates and cell threshold voltages. We show that our model is accurate, with an error of only 4.9%. Based on our experimental findings and our model, we propose HeatWatch, a new mechanism to improve 3D NAND flash memory reliability. The key idea of HeatWatch is to optimize the read reference voltage, i.e., the voltage applied to the cell during a read operation, by adapting it to the dwell time of the workload and the current operating temperature. HeatWatch (1) efficiently tracks flash memory temperature and dwell time online, (2) sends this information to our reliability model to predict the current voltages of flash cells, and (3) predicts the optimal read reference voltage based on the current cell voltages. Our detailed experimental evaluations show that HeatWatch improves flash lifetime by 3.85× over a baseline that uses a fixed read reference voltage, averaged across 28 real storage workload traces, and comes within 0.9% of the lifetime of an ideal read reference voltage selection mechanism.
@inproceedings{abc,
	abstract = {NAND flash memory density continues to scale to keep up with the increasing storage demands of data-intensive applications. Unfortunately, as a result of this scaling, the lifetime of NAND flash memory has been decreasing. Each cell in NAND flash memory can endure only a limited number of writes, due to the damage caused by each program and erase operation on the cell. This damage can be partially repaired on its own during the idle time between program or erase operations (known as the dwell time), via a phenomenon known as the self-recovery effect. Prior works study the self-recovery effect for planar (i.e., 2D) NAND flash memory, and propose to exploit it to improve flash lifetime, by applying high temperature to accelerate selfrecovery. However, these findings may not be directly applicable to 3D NAND flash memory, due to significant changes in the design and manufacturing process that are required to enable practical 3D stacking for NAND flash memory. In this paper, we perform the first detailed experimental characterization of the effects of self-recovery and temperature on real, state-of-the-art 3D NAND flash memory devices. We show that these effects influence two major factors of NAND flash memory reliability: (1) retention loss speed (i.e., the speed at which a flash cell leaks charge), and (2) program variation (i.e., the difference in programming speed across flash cells). We find that self-recovery and temperature affect 3D NAND flash memory quite differently than they affect planar NAND flash memory, rendering prior models of self-recovery and temperature ineffective for 3D NAND flash memory. Using our characterization results, we develop a new model for 3D NAND flash memory reliability, which predicts how retention, wearout, self-recovery, and temperature affect raw bit error rates and cell threshold voltages. We show that our model is accurate, with an error of only 4.9\%. Based on our experimental findings and our model, we propose HeatWatch, a new mechanism to improve 3D NAND flash memory reliability. The key idea of HeatWatch is to optimize the read reference voltage, i.e., the voltage applied to the cell during a read operation, by adapting it to the dwell time of the workload and the current operating temperature. HeatWatch (1) efficiently tracks flash memory temperature and dwell time online, (2) sends this information to our reliability model to predict the current voltages of flash cells, and (3) predicts the optimal read reference voltage based on the current cell voltages. Our detailed experimental evaluations show that HeatWatch improves flash lifetime by 3.85{\texttimes} over a baseline that uses a fixed read reference voltage, averaged across 28 real storage workload traces, and comes within 0.9\% of the lifetime of an ideal read reference voltage selection mechanism.},
	author = {Yixin Luo and Saugata Ghose and Yu Cai and Erich F. Haratsch and Onur Mutlu},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA)},
	title = {HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature-Awareness},
	venue = {Vienna, Austria},
	year = {2018}
}
Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018
Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55◦C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70◦C and 1426x (868x, 1783x) at 55◦C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.
@inproceedings{abc,
	abstract = {Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55{\textopenbullet}C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70{\textopenbullet}C and 1426x (868x, 1783x) at 55{\textopenbullet}C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.},
	author = {Jeremie Kim and Minesh Patel and Hasan Hassan and Onur Mutlu},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA)},
	title = {The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Mod...},
	venue = {Vienna, Austria},
	year = {2018}
}
Proceedings of the 16th USENIX Conference on File and Storage Technologies, Oakland, CA, USA, February 2018
Solid-state drives (SSDs) are used in a wide array of computer systems today, including in datacenters and enterprise servers. As the I/O demands of these systems continue to increase, manufacturers are evolving SSD architectures to keep up with this demand. For example, manufacturers have introduced new high-bandwidth interfaces to replace the conventional SATA host-interface protocol. These new interfaces, such as the NVMe protocol, are designed specifically to enable the high amounts of concurrent I/O bandwidth that SSDs are capable of delivering. While modern SSDs with sophisticated features such as the NVMe protocol are already on the market, existing SSD simulation tools have fallen behind, as they do not capture these new features. We find that state-of-the-art SSD simulators have three shortcomings that prevent them from accurately modeling the performance of real off-the-shelf SSDs. First, these simulators do not model critical features of new protocols (e.g., NVMe), such as their use of multiple application-level queues for requests and the elimination of OS intervention for I/O request processing. Second, these simulators often do not accurately capture the impact of advanced SSD maintenance algorithms (e.g., garbage collection), as they do not properly or quickly emulate steady-state conditions that can significantly change the behavior of these algorithms in real SSDs. Third, these simulators do not capture the full end-to-end latency of I/O requests, which can incorrectly skew the results reported for SSDs that make use of emerging non-volatile memory technologies. By not accurately modeling these three features, existing simulators report results that deviate significantly from real SSD performance. In this work, we introduce a new simulator, called MQSim, that accurately models the performance of both modern SSDs and conventional SATA-based SSDs. MQSim faithfully models new high-bandwidth protocol implementations, steady-state SSD conditions, and the full end-to-end latency of requests in modern SSDs. We validate MQSim, showing that it reports performance results that are only 6%-18% apart from the measured actual performance of four real state-of-the-art SSDs. We show that by modeling critical features of modern SSDs, MQSim uncovers several real and important issues that were not captured by existing simulators, such as the performance impact of inter-flow interference. We have released MQSim as an open-source tool, and we hope that it can enable researchers to explore directions in new and different areas.
@inproceedings{abc,
	abstract = {Solid-state drives (SSDs) are used in a wide array of computer systems today, including in datacenters and enterprise servers. As the I/O demands of these systems continue to increase, manufacturers are evolving SSD architectures to keep up with this demand. For example, manufacturers have introduced new high-bandwidth interfaces to replace the conventional SATA host-interface protocol. These new interfaces, such as the NVMe protocol, are designed specifically to enable the high amounts of concurrent I/O bandwidth that SSDs are capable of delivering.

While modern SSDs with sophisticated features such as the NVMe protocol are already on the market, existing SSD simulation tools have fallen behind, as they do not capture these new features. We find that state-of-the-art SSD simulators have three shortcomings that prevent them from accurately modeling the performance of real off-the-shelf SSDs. First, these simulators do not model critical features of new protocols (e.g., NVMe), such as their use of multiple application-level queues for requests and the elimination of OS intervention for I/O request processing. Second, these simulators often do not accurately capture the impact of advanced SSD maintenance algorithms (e.g., garbage collection), as they do not properly or quickly emulate steady-state conditions that can significantly change the behavior of these algorithms in real SSDs. Third, these simulators do not capture the full end-to-end latency of I/O requests, which can incorrectly skew the results reported for SSDs that make use of emerging non-volatile memory technologies. By not accurately modeling these three features, existing simulators report results that deviate significantly from real SSD performance.

In this work, we introduce a new simulator, called MQSim, that accurately models the performance of both modern SSDs and conventional SATA-based SSDs. MQSim faithfully models new high-bandwidth protocol implementations, steady-state SSD conditions, and the full end-to-end latency of requests in modern SSDs. We validate MQSim, showing that it reports performance results that are only 6\%-18\% apart from the measured actual performance of four real state-of-the-art SSDs. We show that by modeling critical features of modern SSDs, MQSim uncovers several real and important issues that were not captured by existing simulators, such as the performance impact of inter-flow interference. We have released MQSim as an open-source tool, and we hope that it can enable researchers to explore directions in new and different areas.},
	author = {Arash Tavakkol and Juan Gomez-Luna and Mohammad Sadrosadati and Saugata Ghose and Onur Mutlu},
	booktitle = {Proceedings of the 16th USENIX Conference on File and Storage Technologies},
	title = {MQsim: a framework for enabling realistic studies of modern multi-queue SSD devices},
	venue = {Oakland, CA, USA},
	year = {2018}
}

2017

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, November 2017
Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of p1/3 less communication on p processors than the best known alternatives, for graphs with n vertices and average degree k = n/p2/3. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems.
@inproceedings{abc,
	abstract = {Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of p1/3 less communication on p processors than the best known alternatives, for graphs with n vertices and average degree k = n/p2/3. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems.},
	author = {Edgar Solomonik and Maciej Besta and Flavio Vella and Torsten Hoefler},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
	title = {Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication},
	venue = {Denver, CO, USA},
	year = {2017}
}
Bioinformatics, November 2017
Motivation High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and ‘candidate’ locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper’s execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. Results We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.
@article{abc,
	abstract = {Motivation
High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and {\textquoteleft}candidate{\textquoteright} locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper{\textquoteright}s execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms.

Results
We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96\%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.},
	author = {Mohammed Alser and Hasan Hassan and Hongyi Xin and Oguz Ergin and Onur Mutlu and Can Alkan},
	pages = {3355-3363},
	journal = {Bioinformatics},
	title = {GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping},
	volume = {33},
	year = {2017}
}
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, November 2017
Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.
@inproceedings{abc,
	abstract = {Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today{\textquoteright}s network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.},
	author = {Torsten Hoefler and Salvatore Di Girolamo and Konstantin Taranov and Ron Brightwell},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
	title = {sPIN: High-performance streaming Processing in the Network},
	venue = {Denver, CO, USA},
	year = {2017}
}
Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL), Ghent, Belgium, October 2017
Decision tree ensembles are commonly used in a wide range of applications and becoming the de facto algorithm for decision tree based classifiers. Different trees in an ensemble can be processed in parallel during tree inference, making them a suitable use case for FPGAs. Large tree ensembles, however, require careful mapping of trees to on-chip memory and management of memory accesses. As a result, existing FPGA solutions suffer from the inability to scale beyond tens of trees and lack the flexibility to support different tree ensembles. In this paper we present an FPGA tree ensemble classifier together with a software driver to efficiently manage the FPGA's memory resources. The classifier architecture efficiently utilizes the FPGA's resources to fit half a million tree nodes in on-chip memory, delivering up to 20× speedup over a 10-threaded CPU implementation when fully processing the tree ensemble on the FPGA. It can also combine the CPU and FPGA to scale to tree ensembles that do not fit in on-chip memory, achieving up to an order of magnitude speedup compared to a pure CPU implementation. In addition, the classifier architecture can be programmed at runtime to process varying tree ensemble sizes.
@inproceedings{abc,
	abstract = {Decision tree ensembles are commonly used in a wide range of applications and becoming the de facto algorithm for decision tree based classifiers. Different trees in an ensemble can be processed in parallel during tree inference, making them a suitable use case for FPGAs. Large tree ensembles, however, require careful mapping of trees to on-chip memory and management of memory accesses. As a result, existing FPGA solutions suffer from the inability to scale beyond tens of trees and lack the flexibility to support different tree ensembles. In this paper we present an FPGA tree ensemble classifier together with a software driver to efficiently manage the FPGA{\textquoteright}s memory resources. The classifier architecture efficiently utilizes the FPGA{\textquoteright}s resources to fit half a million tree nodes in on-chip memory, delivering up to 20{\texttimes} speedup over a 10-threaded CPU implementation when fully processing the tree ensemble on the FPGA. It can also combine the CPU and FPGA to scale to tree ensembles that do not fit in on-chip memory, achieving up to an order of magnitude speedup compared to a pure CPU implementation. In addition, the classifier architecture can be programmed at runtime to process varying tree ensemble sizes.},
	author = {Muhsen Owaida and Hantian Zhang and Ce Zhang and Gustavo Alonso},
	booktitle = {Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL)},
	title = {Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms},
	venue = {Ghent, Belgium},
	year = {2017}
}
Dagstuhl Reports, October 2017
A number of physical limitations mandate radical changes in the way how we build computing hard- and software, and there is broad consensus that a stronger interaction between hard- and software communities is needed to meet the ever-growing demand for application performance. Under this motivation, representatives from various hard- and software communities have met at the Dagstuhl seminar "Databases on Future Hardware" to discuss the implications in the context of database systems. The outcome of the seminar was not only a much better understanding of each other's needs, constraints, and ways of thinking. Very importantly, the group identified topic areas that seem key for the ongoing shift, together with suggestions on how the field could move forward. During the seminar, it turned out that the future of databases is not only a question of technology. Rather, economic considerations have to be taken into account when building next-generation database engines.
@article{abc,
	abstract = {A number of physical limitations mandate radical changes in the way how we build computing hard- and software, and there is broad consensus that a stronger interaction between hard- and software communities is needed to meet the ever-growing demand for application performance. Under this motivation, representatives from various hard- and software communities have met at the Dagstuhl seminar "Databases on Future Hardware" to discuss the implications in the context of database systems. The outcome of the seminar was not only a much better understanding of each other{\textquoteright}s needs, constraints, and ways of thinking. Very importantly, the group identified topic areas that seem key for the ongoing shift, together with suggestions on how the field could move forward. During the seminar, it turned out that the future of databases is not only a question of technology. Rather, economic considerations have to be taken into account when building next-generation database engines.},
	author = {Gustavo Alonso and Michaela Blott and Jens Teubner},
	pages = {1-18},
	journal = {Dagstuhl Reports},
	title = {Databases on Future Hardware},
	volume = {7},
	year = {2017}
}
IEEE Transactions on Parallel and Distributed Systems, October 2017
The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
@article{abc,
	abstract = {The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.},
	author = {Didem Unat and Anshu Dubey and Torsten Hoefler and John Shalf and Mark Abraham and Mauro Bianco and Bradford L. Chamberlain and Romain Cledat and H. Carter Edwards and Hal Finkel and Karl Fuerlinger and Frank Hannig and Emmanuel Jeannot and Amir Kamil and Jeff Keasler and Paul H. J. Kelly and Vitus Leung and Hatem Ltaief and Naoya Maruyama and Chris J. Newburn and Miquel Pericas},
	pages = {3007-3020},
	journal = {IEEE Transactions on Parallel and Distributed Systems},
	title = {Trends in Data Locality Abstractions for HPC Systems},
	volume = {28},
	year = {2017}
}
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017
Contemporary discrete GPUs support rich memory management features such as virtual memory and demand paging. These features simplify GPU programming by providing a virtual address space abstraction similar to CPUs and eliminating manual memory management, but they introduce high performance overheads during (1) address translation and (2) page faults. A GPU relies on high degrees of thread-level parallelism (TLP) to hide memory latency. Address translation can undermine TLP, as a single miss in the translation lookaside buffer (TLB) invokes an expensive serialized page table walk that often stalls multiple threads. Demand paging can also undermine TLP, as multiple threads often stall while they wait for an expensive data transfer over the system I/O (e.g., PCIe) bus when the GPU demands a page. In modern GPUs, we face a trade-off on how the page size used for memory management affects address translation and demand paging. The address translation overhead is lower when we employ a larger page size (e.g., 2MB large pages, compared with conventional 4KB base pages), which increases TLB coverage and thus reduces TLB misses. Conversely, the demand paging overhead is lower when we employ a smaller page size, which decreases the system I/O bus transfer latency. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing (i.e., merging base pages into a large page) and splintering (i.e., splitting a large page into base pages) policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once). We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages. We introduce Mosaic, a GPU memory manager that provides application-transparent support for multiple page sizes. Mosaic uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that (1) preserves base page contiguity and (2) ensures that a large page frame contains pages from only a single memory protection domain. We take advantage of this allocation strategy to design a novel in-place page size selection mechanism that avoids data migration. This mechanism allows the TLB to use large pages, reducing address translation overhead. During data transfer, this mechanism enables the GPU to transfer only the base pages that are needed by the application over the system I/O bus, keeping demand paging overhead low. Our evaluations show that Mosaic reduces address translation overheads while efficiently achieving the benefits of demand paging, compared to a contemporary GPU that uses only a 4KB page size. Relative to a state-of-the-art GPU memory manager, Mosaic improves the performance of homogeneous and heterogeneous multi-application workloads by 55.5% and 29.7% on average, respectively, coming within 6.8% and 15.4% of the performance of an ideal TLB where all TLB requests are hits.
@inproceedings{abc,
	abstract = {Contemporary discrete GPUs support rich memory management features such as virtual memory and demand paging. These features simplify GPU programming by providing a virtual address space abstraction similar to CPUs and eliminating manual memory management, but they introduce high performance overheads during (1) address translation and (2) page faults. A GPU relies on high degrees of thread-level parallelism (TLP) to hide memory latency. Address translation can undermine TLP, as a single miss in the translation lookaside buffer (TLB) invokes an expensive serialized page table walk that often stalls multiple threads. Demand paging can also undermine TLP, as multiple threads often stall while they wait for an expensive data transfer over the system I/O (e.g., PCIe) bus when the GPU demands a page.

In modern GPUs, we face a trade-off on how the page size used for memory management affects address translation and demand paging. The address translation overhead is lower when we employ a larger page size (e.g., 2MB large pages, compared with conventional 4KB base pages), which increases TLB coverage and thus reduces TLB misses. Conversely, the demand paging overhead is lower when we employ a smaller page size, which decreases the system I/O bus transfer latency. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing (i.e., merging base pages into a large page) and splintering (i.e., splitting a large page into base pages) policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once). We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages.

We introduce Mosaic, a GPU memory manager that provides application-transparent support for multiple page sizes. Mosaic uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that (1) preserves base page contiguity and (2) ensures that a large page frame contains pages from only a single memory protection domain. We take advantage of this allocation strategy to design a novel in-place page size selection mechanism that avoids data migration. This mechanism allows the TLB to use large pages, reducing address translation overhead. During data transfer, this mechanism enables the GPU to transfer only the base pages that are needed by the application over the system I/O bus, keeping demand paging overhead low. Our evaluations show that Mosaic reduces address translation overheads while efficiently achieving the benefits of demand paging, compared to a contemporary GPU that uses only a 4KB page size. Relative to a state-of-the-art GPU memory manager, Mosaic improves the performance of homogeneous and heterogeneous multi-application workloads by 55.5\% and 29.7\% on average, respectively, coming within 6.8\% and 15.4\% of the performance of an ideal TLB where all TLB requests are hits.},
	author = {Rachata Ausavarungnirun and Joshua Landgraf and Vance Miller and Saugata Ghose and Jayneel Gandhi and Christopher Rossbach and },
	booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
	title = {Mosaic: a GPU memory manager with application-transparent support for multiple page sizes},
	venue = {Cambridge, MA, USA},
	year = {2017}
}
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017
DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge. In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle. Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65--74%, leading to a 10%/17%/40% (min) to 12%/22%/50% (max) performance improvement for a single-core and 10%/23%/52% (min) to 17%/29%/65% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.
@inproceedings{abc,
	abstract = {DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge.

In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle.

Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65--74\%, leading to a 10\%/17\%/40\% (min) to 12\%/22\%/50\% (max) performance improvement for a single-core and 10\%/23\%/52\% (min) to 17\%/29\%/65\% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.},
	author = {Samira Manabi Khan and Chris Wilkerson and Zhe Wang and Alaa R. Alameldeen and Donghyuk Lee and Onur Mutlu},
	booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
	title = {Detecting and mitigating data-dependent DRAM failures by exploiting current memory content},
	venue = {Cambridge, MA, USA},
	year = {2017}
}
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, USA, October 2017
Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory). To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus. Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.
@inproceedings{abc,
	abstract = {Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).

To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1\% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus.

Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.},
	author = {Vivek Seshadri and Donghyuk Lee and Thomas Mullins and Hasan Hassan and Amirali Boroumand and Jeremie Kim and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
	booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
	title = {Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology},
	venue = {Cambridge, MA, USA},
	year = {2017}
}
Proceedins of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA, September 2017
While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity. Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (e.g., different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy. The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology. Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page. It is important to make intelligent page management (i.e., placement and migration) decisions, as they can significantly affect system performance.In this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (i.e., the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement. UH-MEM operates in two steps. First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism. Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migration.We evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories. For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14% on average (and up to 26%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads.
@inproceedings{abc,
	abstract = {While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity. Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (e.g., different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy. The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology. Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page. It is important to make intelligent page management (i.e., placement and migration) decisions, as they can significantly affect system performance.In this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (i.e., the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement. UH-MEM operates in two steps. First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism. Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migration.We evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories. For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14\% on average (and up to 26\%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads.},
	author = {Yang Li and Saugata Ghose and Jongmoo Choi and Jin Sun and Hui Wang and Onur Mutlu},
	booktitle = {Proceedins of the 2017 IEEE International Conference on Cluster Computing (CLUSTER)},
	title = {Utility-Based Hybrid Memory Management},
	venue = {Honolulu, HI, USA},
	year = {2017}
}
Proceedings of the 25th Annual Symposium on High-Performance Interconnects (HOTI'17), Santa Clara, CA, USA, August 2017
Interconnection networks must meet the communication demands of current High-Performance Computing systems. In order to interconnect efficiently the end nodes of these systems with a good performance-to-cost ratio, new network topologies have been proposed in the last years that leverage high-radix switches, such as Slim Fly. Adversarial traffic patterns, however, may reduce severely the performance of Slim Fly networks when using only minimal-path routing. In order to mitigate the performance degradation in these scenarios, Slim Fly networks should configure an oblivious or adaptive non-minimal routing. The non-minimal routing algorithms proposed for Slim Fly usually rely on Valiant's algorithm to select the paths, at the cost of doubling the average path-length, as well as the number of Virtual Channels (VCs) required to prevent deadlocks. Moreover, Valiant may introduce additional inefficiencies when applied to Slim Fly networks, such as the "turn-around problem" that we analyze in this work. With the aim of overcoming these drawbacks, we propose in this paper two variants of the Valiant's algorithm that improve the non-minimal path selection in Slim Fly networks. They are designed to be combined with adaptive routing algorithms that rely on Valiant to select non-minimalpaths, such as UGAL or PAR, which we have adapted to the Slim Fly topology. Through the results from simulation experiments, we show that our proposals improve the network performance and/or reduce the number of required VCs to prevent deadlocks, even in scenarios with adversarial traffic.
@inproceedings{abc,
	abstract = {Interconnection networks must meet the communication demands of current High-Performance Computing systems. In order to interconnect efficiently the end nodes of these systems with a good performance-to-cost ratio, new network topologies have been proposed in the last years that leverage high-radix switches, such as Slim Fly. Adversarial traffic patterns, however, may reduce severely the performance of Slim Fly networks when using only minimal-path routing. In order to mitigate the performance degradation in these scenarios, Slim Fly networks should configure an oblivious or adaptive non-minimal routing. The non-minimal routing algorithms proposed for Slim Fly usually rely on Valiant{\textquoteright}s algorithm to select the paths, at the cost of doubling the average path-length, as well as the number of Virtual Channels (VCs) required to prevent deadlocks. Moreover, Valiant may introduce additional inefficiencies when applied to Slim Fly networks, such as the "turn-around problem" that we analyze in this work. With the aim of overcoming these drawbacks, we propose in this paper two variants of the Valiant{\textquoteright}s algorithm that improve the non-minimal path selection in Slim Fly networks. They are designed to be combined with adaptive routing algorithms that rely on Valiant to select non-minimalpaths, such as UGAL or PAR, which we have adapted to the Slim Fly topology. Through the results from simulation experiments, we show that our proposals improve the network performance and/or reduce the number of required VCs to prevent deadlocks, even in scenarios with adversarial traffic.},
	author = {Pedro Yebenes and Jesus Escudero-Sahuquillo and Pedro Javier Garcia and Francisco J. Quiles and Torsten Hoefler},
	booktitle = {Proceedings of the 25th Annual Symposium on High-Performance Interconnects (HOTI{\textquoteright}17)},
	title = {Improving Non-Minimal and Adaptive Routing Algorithms in Slim Fly Networks},
	venue = {Santa Clara, CA, USA},
	year = {2017}
}
Proceedings of the 25th Annual Symposium on High-Performance Interconnects (HOTI'17), Santa Clara, CA, USA, August 2017
The advent of non-volatile memory (NVM) technologies has added an interesting nuance to the node level memory hierarchy. With modern 100 Gb/s networks, the NVM tier of storage can often be slower than the high performance network in the system; thus, a new challenge arises in the datacenter. Whereas prior efforts have studied the impacts of multiple sources targeting one node (i.e., incast) and have studied multiple flows causing congestion in inter-switch links, it is now possible for a single flow from a single source to overwhelm the bandwidth of a key portion of the memory hierarchy. This can subsequently spread to the switches and lead to congestion trees in a flow-controlled network or excessive packet drops without flow control. In this work we describe protocols which avoid overwhelming the receiver in the case of a source/sink rate mismatch. We design our protocols on top of Portals 4, which enables us to make use of network offload. Our protocol yields up to 4× higher throughput in a 5k node Dragonfly topology for a permutation traffic pattern in which only 1% of all nodes have a memory write-bandwidth limitation of 1/8th of the network bandwidth.
@inproceedings{abc,
	abstract = {The advent of non-volatile memory (NVM) technologies has added an interesting nuance to the node level memory hierarchy. With modern 100 Gb/s networks, the NVM tier of storage can often be slower than the high performance network in the system; thus, a new challenge arises in the datacenter. Whereas prior efforts have studied the impacts of multiple sources targeting one node (i.e., incast) and have studied multiple flows causing congestion in inter-switch links, it is now possible for a single flow from a single source to overwhelm the bandwidth of a key portion of the memory hierarchy. This can subsequently spread to the switches and lead to congestion trees in a flow-controlled network or excessive packet drops without flow control. In this work we describe protocols which avoid overwhelming the receiver in the case of a source/sink rate mismatch. We design our protocols on top of Portals 4, which enables us to make use of network offload. Our protocol yields up to 4{\texttimes} higher throughput in a 5k node Dragonfly topology for a permutation traffic pattern in which only 1\% of all nodes have a memory write-bandwidth limitation of 1/8th of the network bandwidth.},
	author = {Timo Schneider and James Dinan and Mario Flajslik and Keith D. Underwood and Torsten Hoefler},
	booktitle = {Proceedings of the 25th Annual Symposium on High-Performance Interconnects (HOTI{\textquoteright}17)},
	title = {Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches},
	venue = {Santa Clara, CA, USA},
	year = {2017}
}
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, August 2017
@inproceedings{abc,
	author = {Hantian Zhang and Jerry Li and Kaan Kara and Dan Alistarh and Ji Liu and Ce Zhang},
	booktitle = {Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia},
	title = {ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning.},
	url = {http://proceedings.mlr.press/v70/zhang17e.html},
	year = {2017}
}
Proceedings of the VLDB Endowment, Munich, Germany, August 2017
The ever increasing amount of data being handled in data centers causes an intrinsic inefficiency: moving data around is expensive in terms of bandwidth, latency, and power consumption, especially given the low computational complexity of many database operations. In this paper we explore near-data processing in database engines, i.e., the option of offloading part of the computation directly to the storage nodes. We implement our ideas in Caribou, an intelligent distributed storage layer incorporating many of the lessons learned while building systems with specialized hardware. Caribou provides access to DRAM/NVRAM storage over the network through a simple key-value store interface, with each storage node providing high-bandwidth near-data processing at line rate and fault tolerance through replication. The result is a highly efficient, distributed, intelligent data storage that can be used to both boost performance and reduce power consumption and real estate usage in the data center thanks to the micro-server architecture adopted.
@inproceedings{abc,
	abstract = {The ever increasing amount of data being handled in data centers causes an intrinsic inefficiency: moving data around is expensive in terms of bandwidth, latency, and power consumption, especially given the low computational complexity of many database operations. In this paper we explore near-data processing in database engines, i.e., the option of offloading part of the computation directly to the storage nodes. We implement our ideas in Caribou, an intelligent distributed storage layer incorporating many of the lessons learned while building systems with specialized hardware. Caribou provides access to DRAM/NVRAM storage over the network through a simple key-value store interface, with each storage node providing high-bandwidth near-data processing at line rate and fault tolerance through replication. The result is a highly efficient, distributed, intelligent data storage that can be used to both boost performance and reduce power consumption and real estate usage in the data center thanks to the micro-server architecture adopted.},
	author = {Zsolt Istv{\'a}n and David Sidler and Gustavo Alonso},
	booktitle = {Proceedings of the VLDB Endowment},
	title = {Caribou: Intelligent Distributed Storage},
	venue = {Munich, Germany},
	year = {2017}
}
9th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2017, Santa Clara, CA, USA, July 2017
@inproceedings{abc,
	author = {Debopam Bhattacherjee and Muhammad Tirmazi and Ankit Singla},
	booktitle = {9th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2017, Santa Clara, CA, USA},
	title = {A Cloud-based Content Gathering Network.},
	url = {https://www.usenix.org/conference/hotcloud17/program/presentation/bhattacherjee},
	year = {2017}
}
Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 2017
@inproceedings{abc,
	author = {Zhiyu Liu and Irina Calciu and Maurice Herlihy and Onur Mutlu},
	booktitle = {Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA},
	title = {Concurrent Data Structures for Near-Memory Computing.},
	url = {http://doi.acm.org/10.1145/3087556.3087582},
	year = {2017}
}
Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, Washington, DC, USA, June 2017
Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires \Theta(\sqrt{c}) less interprocessor communication than previously known algorithms, for any c\leq p^{1/3} when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(\log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.
@inproceedings{abc,
	abstract = {Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires \Theta(\sqrt{c}) less interprocessor communication than previously known algorithms, for any c\leq p^{1/3} when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O(\log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.},
	author = {Edgar Solomonik and Grey Ballard and James Demmel and Torsten Hoefler},
	booktitle = {Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures},
	title = {A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem},
	venue = {Washington, DC, USA},
	year = {2017}
}
Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, June 2017
Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.
@inproceedings{abc,
	abstract = {Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50\%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33\%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.},
	author = {Maciej Besta and Florian Marending and Edgar Solomonik and },
	booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)},
	title = {SlimSell: A Vectorized Graph Representation for Breadth-First Search},
	venue = {Orlando, FL, USA},
	year = {2017}
}
Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 2017
@inproceedings{abc,
	author = {Xi-Yue Xiang and Wentao Shi and Saugata Ghose and Lu Peng and Onur Mutlu and Nian-Feng Tzeng},
	booktitle = {Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA},
	title = {Carpool: a bufferless on-chip network supporting adaptive multicast and hotspot alleviation.},
	url = {http://doi.acm.org/10.1145/3079079.3079090},
	year = {2017}
}
Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 2017
@inproceedings{abc,
	author = {Minesh Patel and Jeremie Kim and Onur Mutlu},
	booktitle = {Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada},
	title = {The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions.},
	url = {http://doi.acm.org/10.1145/3079856.3080242},
	year = {2017}
}
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, Washington, DC, USA, June 2017
Many distributed systems require coordination between the components involved. With the steady growth of such systems, the probability of failures increases, which necessitates scalable fault-tolerant agreement protocols. The most common practical agreement protocol, for such scenarios, is leader-based atomic broadcast. In this work, we propose AllConcur, a distributed system that provides agreement through a leaderless concurrent atomic broadcast algorithm, thus, not suffering from the bottleneck of a central coordinator. In AllConcur, all components exchange messages concurrently through a logical overlay network that employs early termination to minimize the agreement latency. Our implementation of AllConcur supports standard sockets-based TCP as well as high-performance InfiniBand Verbs communications. AllConcur can handle up to 135 million requests per second and achieves 17x higher throughput than today's standard leader-based protocols, such as Libpaxos. Thus, AllConcur is highly competitive with regard to existing solutions and, due to its decentralized approach, enables hitherto unattainable system designs in a variety of fields.
@inproceedings{abc,
	abstract = {Many distributed systems require coordination between the components involved. With the steady growth of such systems, the probability of failures increases, which necessitates scalable fault-tolerant agreement protocols. The most common practical agreement protocol, for such scenarios, is leader-based atomic broadcast. In this work, we propose AllConcur, a distributed system that provides agreement through a leaderless concurrent atomic broadcast algorithm, thus, not suffering from the bottleneck of a central coordinator. In AllConcur, all components exchange messages concurrently through a logical overlay network that employs early termination to minimize the agreement latency. Our implementation of AllConcur supports standard sockets-based TCP as well as high-performance InfiniBand Verbs communications. AllConcur can handle up to 135 million requests per second and achieves 17x higher throughput than today{\textquoteright}s standard leader-based protocols, such as Libpaxos. Thus, AllConcur is highly competitive with regard to existing solutions and, due to its decentralized approach, enables hitherto unattainable system designs in a variety of fields.},
	author = {Marius Poke and Torsten Hoefler and Colin W. Glass},
	booktitle = {Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing},
	title = {AllConcur: Leaderless Concurrent Atomic Broadcast},
	venue = {Washington, DC, USA},
	year = {2017}
}
Systems Group Master's Thesis, no. 169; Department of Computer Science, June 2017
Supervised by: Prof. Torsten Hoefler
The polyhedral model makes a large set of transformations readily available for loop nest optimizations. Like traditional program optimization it is not always clear which of those transformations are profitable and under what circumstances. In traditional compilers this is dealt with the use of hand-crafted heuristics that are often platform specific and not universality “good”. In polyhedral compilation the problem of finding the appropriate transformations is formulated as an Integer Linear Program (ILP). However, not all relevant software and hardware aspects are or can be modelled in such an ILP. Consequently, transformations obtained by solving such ILPs are not necessarily optimal. In this work, we propose the use of data-driven heuristics in polyhedral optimization. To this end, we present a hybrid approach that uses an ILP solver in cooperation with a trained machine learning model that functions as a loop fusion heuristic. The ILP scheduler optimizes individual loops, our heuristic decides which ones should be fused, and then the ILP solver further optimizes the combined loops. The heuristic discovers by itself how loop fusion interacts with the ILP scheduler, the rest of the compiler, and the hardware without needing a compiler or hardware “expert” to teach it.
@mastersthesis{abc,
	abstract = {The polyhedral model makes a large set of transformations readily available for loop nest optimizations. Like traditional program optimization it is not always clear which of those transformations are profitable and under what circumstances. In traditional compilers this is dealt with the use of hand-crafted heuristics that are often platform specific and not universality {\textquotedblleft}good{\textquotedblright}. In polyhedral compilation the problem of finding the appropriate transformations is formulated as an Integer Linear Program (ILP). However, not all relevant software and hardware aspects are or can be modelled in such an ILP. Consequently, transformations obtained by solving such ILPs are not necessarily optimal.
In this work, we propose the use of data-driven heuristics in polyhedral optimization. To this end, we present a hybrid approach that uses an ILP solver in cooperation with a trained machine learning model that functions as a loop fusion heuristic. The ILP scheduler optimizes individual loops, our heuristic decides which ones should be fused, and then the ILP solver further optimizes the combined loops.
The heuristic discovers by itself how loop fusion interacts with the ILP scheduler, the rest of the compiler, and the hardware without needing a compiler or hardware {\textquotedblleft}expert{\textquotedblright} to teach it.},
	author = {Theodoros Theodoridis },
	school = {169},
	title = {Fused {\textquotedblleft}Learning{\textquotedblright} and {\textquotedblleft}Linear Programming{\textquotedblright}: a Novel Loop Fusion Approach},
	year = {2017}
}
Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA, June 2017
@inproceedings{abc,
	author = {Kevin K. Chang and Abdullah Giray Yaglik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.},
	url = {http://doi.acm.org/10.1145/3078505.3078590},
	year = {2017}
}
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, Washington, DC, USA, June 2017
We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.
@inproceedings{abc,
	abstract = {We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.},
	author = {Maciej Besta and Michal Podstawski and Linus Groner and Edgar Solomonik and Torsten Hoefler},
	booktitle = {Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing},
	title = {To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations},
	venue = {Washington, DC, USA},
	year = {2017}
}
Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA, June 2017
@inproceedings{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Saugata Ghose and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu},
	booktitle = {Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, Urbana-Champaign, IL, USA},
	title = {Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.},
	url = {http://doi.acm.org/10.1145/3078505.3078533},
	year = {2017}
}
Proceedings of the International Conference on Computational Science (ICCS'17), Zurich, Switzerland, June 2017
Designing a partial differential equations solver is a complex task which involves making choices about the solution algorithm and its parameters. Such choices are usually done on the basis of personal preference or numerical experiments, which can introduce significant bias on the selection process. In this work we develop a methodology to drive this selection process towards the optimal choices by modelling the accuracy and the performance of the solution algorithm. We show how this methodology can be successfully applied on the linear advection problem. As a result, the selection can be optimally performed with a much lower investment on the development of high-performance versions of the solvers and without using the target architecture for numerical experiments.
@inproceedings{abc,
	abstract = {Designing a partial differential equations solver is a complex task which involves making choices about the solution algorithm and its parameters. Such choices are usually done on the basis of personal preference or numerical experiments, which can introduce significant bias on the selection process. In this work we develop a methodology to drive this selection process towards the optimal choices by modelling the accuracy and the performance of the solution algorithm. We show how this methodology can be successfully applied on the linear advection problem. As a result, the selection can be optimally performed with a much lower investment on the development of high-performance versions of the solvers and without using the target architecture for numerical experiments.},
	author = {Andrea Arteaga and Oliver Fuhrer and Torsten Hoefler and Thomas Schulthess},
	booktitle = {Proceedings of the International Conference on Computational Science (ICCS{\textquoteright}17)},
	title = {Model-Driven Choice of Numerical  Methods for the Solution of the Linear Advection Equation},
	venue = {Zurich, Switzerland},
	year = {2017}
}
Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 2017
@inproceedings{abc,
	author = {Jiawei Jiang and Bin Cui and Ce Zhang and Lele Yu},
	booktitle = {Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA},
	title = {Heterogeneity-aware Distributed Parameter Servers.},
	url = {http://doi.acm.org/10.1145/3035918.3035933},
	year = {2017}
}
Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 2017
Taking advantage of recently released hybrid multicore architectures, such as the Intel Xeon+FPGA machine, where the FPGA has coherent access to the main memory through the QPI bus, we explore the benefits of specializing operators to hardware. We focus on two commonly used SQL operators for strings: LIKE, and REGEXP_LIKE, and provide a novel and efficient implementation of these operators in reconfigurable hardware. We integrate the hardware accelerator into MonetDB, a main-memory column store, and demonstrate a significant improvement in response time and throughput. Our Hardware User Defined Function (HUDF) can speed up complex pattern matching by an order of magnitude in comparison to the database running on a 10-core CPU. The insights gained from integrating hardware based string operators into MonetDB should also be useful for future designs combining hardware specialization and databases.
@inproceedings{abc,
	abstract = {Taking advantage of recently released hybrid multicore architectures, such as the Intel Xeon+FPGA machine, where the FPGA has coherent access to the main memory through the QPI bus, we explore the benefits of specializing operators to hardware. We focus on two commonly used SQL operators for strings: LIKE, and REGEXP_LIKE, and provide a novel and efficient implementation of these operators in reconfigurable hardware. We integrate the hardware accelerator into MonetDB, a main-memory column store, and demonstrate a significant improvement in response time and throughput. Our Hardware User Defined Function (HUDF) can speed up complex pattern matching by an order of magnitude in comparison to the database running on a 10-core CPU. The insights gained from integrating hardware based string operators into MonetDB should also be useful for future designs combining hardware specialization and databases.},
	author = {David Sidler and Zsolt Istv{\'a}n and Muhsen Owaida and Gustavo Alonso},
	booktitle = {Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017},
	title = {Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures.},
	url = {http://doi.acm.org/10.1145/3035918.3035954},
	venue = {Chicago, IL, USA},
	year = {2017}
}
Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, May 2017
The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI-3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI, a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI libraries.
@inproceedings{abc,
	abstract = {The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI-3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI, a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI libraries.},
	author = {Salvatore Di Girolamo and Flavio Vella and Torsten Hoefler},
	booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)},
	title = {Transparent Caching for RMA Systems},
	venue = {Orlando, FL, USA},
	year = {2017}
}
Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 2017
@inproceedings{abc,
	author = {Darko Makreshanski and Jana Giceva and Claude Barthels and Gustavo Alonso},
	booktitle = {Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA},
	title = {BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications.},
	url = {http://doi.acm.org/10.1145/3035918.3035959},
	year = {2017}
}
Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, May 2017
Large-scale parallel programming environments and algorithms require efficient group-communication on computing systems with failing nodes. Existing reliable broadcast algorithms either cannot guarantee that all nodes are reached or are very expensive in terms of the number of messages and latency. This paper proposes Corrected-Gossip, a method that combines Monte Carlo style gossiping with a deterministic correction phase, to construct a Las Vegas style reliable broadcast that guarantees reaching all the nodes at low cost. We analyze the performance of this method both analytically and by simulations and show how it reduces the latency and network load compared to existing algorithms. Our method improves the latency by 20% and the network load by 53% compared to the fastest known algorithm on 4,096 nodes. We believe that the principle of corrected-gossip opens an avenue for many other reliable group communication operations.
@inproceedings{abc,
	abstract = {Large-scale parallel programming environments and algorithms require efficient group-communication on computing systems with failing nodes. Existing reliable broadcast algorithms either cannot guarantee that all nodes are reached or are very expensive in terms of the number of messages and latency. This paper proposes Corrected-Gossip, a method that combines Monte Carlo style gossiping with a deterministic correction phase, to construct a Las Vegas style reliable broadcast that guarantees reaching all the nodes at low cost. We analyze the performance of this method both analytically and by simulations and show how it reduces the latency and network load compared to existing algorithms. Our method improves the latency by 20\% and the network load by 53\% compared to the fastest known algorithm on 4,096 nodes. We believe that the principle of corrected-gossip opens an avenue for many other reliable group communication operations.},
	author = {Torsten Hoefler and Amnon Barak and Amnon Shiloh and },
	booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)},
	title = {Corrected Gossip Algorithms for Fast Reliable Broadcast on Unreliable Systems},
	venue = {Orlando, FL, USA},
	year = {2017}
}
Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 2017
Implementing parallel operators in multi-core machines often involves a data partitioning step that divides the data into cache-size blocks and arranges them so to allow concurrent threads to process them in parallel. Data partitioning is expensive, in some cases up to 90% of the cost of, e.g., a parallel hash join. In this paper we explore the use of an FPGA to accelerate data partitioning. We do so in the context of new hybrid architectures where the FPGA is located as a co-processor residing on a socket and with coherent access to the same memory as the CPU residing on the other socket. Such an architecture reduces data transfer overheads between the CPU and the FPGA, enabling hybrid operator execution where the partitioning happens on the FPGA and the build and probe phases of a join happen on the CPU. Our experiments demonstrate that FPGA-based partitioning is significantly faster and more robust than CPU-based partitioning. The results open interesting options as FPGAs are gradually integrated tighter with the CPU.
@inproceedings{abc,
	abstract = {Implementing parallel operators in multi-core machines often involves a data partitioning step that divides the data into cache-size blocks and arranges them so to allow concurrent threads to process them in parallel. Data partitioning is expensive, in some cases up to 90\% of the cost of, e.g., a parallel hash join. In this paper we explore the use of an FPGA to accelerate data partitioning. We do so in the context of new hybrid architectures where the FPGA is located as a co-processor residing on a socket and with coherent access to the same memory as the CPU residing on the other socket. Such an architecture reduces data transfer overheads between the CPU and the FPGA, enabling hybrid operator execution where the partitioning happens on the FPGA and the build and probe phases of a join happen on the CPU. Our experiments demonstrate that FPGA-based partitioning is significantly faster and more robust than CPU-based partitioning. The results open interesting options as FPGAs are gradually integrated tighter with the CPU.},
	author = {Kaan Kara and Jana Giceva and Gustavo Alonso},
	booktitle = {Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017},
	title = {FPGA-based Data Partitioning.},
	url = {http://doi.acm.org/10.1145/3035918.3035946},
	venue = {Chicago, IL, USA},
	year = {2017}
}
Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, May 2017
We present a new parallel algorithm for solving triangular systems with multiple right hand sides (TRSM). TRSM is used extensively in numerical linear algebra computations, both to solve triangular linear systems of equations as well as to compute factorizations with triangular matrices, such as Cholesky, LU, and QR. Our algorithm achieves better theoretical scalability than known alternatives, while maintaining numerical stability, via selective use of triangular matrix inversion. We leverage the fact that triangular inversion and matrix multiplication are more parallelizable than the standard TRSM algorithm. By only inverting triangular blocks along the diagonal of the initial matrix, we generalize the usual way of TRSM computation and the full matrix inversion approach. This flexibility leads to an efficient algorithm for any ratio of the number of right hand sides to the triangular matrix dimension. We provide a detailed communication cost analysis for our algorithm as well as for the recursive triangular matrix inversion. This cost analysis makes it possible to determine optimal block sizes and processor grids a priori. Relative to the best known algorithms for TRSM, our approach can require asymptotically fewer messages, while performing optimal amounts of computation and communication in terms of words sent.
@inproceedings{abc,
	abstract = {We present a new parallel algorithm for solving triangular systems with multiple right hand sides (TRSM). TRSM is used extensively in numerical linear algebra computations, both to solve triangular linear systems of equations as well as to compute factorizations with triangular matrices, such as Cholesky, LU, and QR. Our algorithm achieves better theoretical scalability than known alternatives, while maintaining numerical stability, via selective use of triangular matrix inversion. We leverage the fact that triangular inversion and matrix multiplication are more parallelizable than the standard TRSM algorithm. By only inverting triangular blocks along the diagonal of the initial matrix, we generalize the usual way of TRSM computation and the full matrix inversion approach. This flexibility leads to an efficient algorithm for any ratio of the number of right hand sides to the triangular matrix dimension. We provide a detailed communication cost analysis for our algorithm as well as for the recursive triangular matrix inversion. This cost analysis makes it possible to determine optimal block sizes and processor grids a priori. Relative to the best known algorithms for TRSM, our approach can require asymptotically fewer messages, while performing optimal amounts of computation and communication in terms of words sent.},
	author = {Tobias Wicky and Edgar Solomonik and Torsten Hoefler},
	booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)},
	title = {Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations},
	venue = {Orlando, FL, USA},
	year = {2017}
}
Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 2017
@inproceedings{abc,
	author = {David Sidler and Zsolt Istv{\'a}n and Muhsen Owaida and Kaan Kara and Gustavo Alonso},
	booktitle = {Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA},
	title = {doppioDB: A Hardware Accelerated Database.},
	url = {http://doi.acm.org/10.1145/3035918.3058746},
	year = {2017}
}
Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS'17), Orlando, FL, USA, May 2017
Increasingly complex memory systems and onchip interconnects are developed to mitigate the data movement bottlenecks in manycore processors. One example of such a complex system is the Xeon Phi KNL CPU with three different types of memory, fifteen memory configuration options, and a complex on-chip mesh network connecting up to 72 cores. Users require a detailed understanding of the performance characteristics of the different options to utilize the system efficiently. Unfortunately, peak performance is rarely achievable and achievable performance is hardly documented. We address this with capability models of the memory subsystem, derived by systematic measurements, to guide users to navigate the complex optimization space. As a case study, we provide an extensive model of all memory configuration options for Xeon Phi KNL. We demonstrate how our capability model can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel's tuned OpenMP and MPI implementations, respectively. Furthermore, we demonstrate how to use the models to assess how efficiently a bitonic sort application utilizes the memory resources. Interestingly, our capability models predict and explain that the high bandwidthMCDRAM does not improve the bitonic sort performance over DRAM.
@inproceedings{abc,
	abstract = {Increasingly complex memory systems and onchip interconnects are developed to mitigate the data movement bottlenecks in manycore processors. One example of such a complex system is the Xeon Phi KNL CPU with three different types of memory, fifteen memory configuration options, and a complex on-chip mesh network connecting up to 72 cores. Users require a detailed understanding of the performance characteristics of the different options to utilize the system efficiently. Unfortunately, peak performance is rarely achievable and achievable performance is hardly documented. We address this with capability models of the memory subsystem, derived by systematic measurements, to guide users to navigate the complex optimization space. As a case study, we provide an extensive model of all memory configuration options for Xeon Phi KNL. We demonstrate how our capability model can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel{\textquoteright}s tuned OpenMP and MPI implementations, respectively. Furthermore, we demonstrate how to use the models to assess how efficiently a bitonic sort application utilizes the memory resources. Interestingly, our capability models predict and explain that the high bandwidthMCDRAM does not improve the bitonic sort performance over DRAM.},
	author = {Sabela Ramos and Torsten Hoefler},
	booktitle = {Proceedings of the 31st IEEE International Parallel \& Distributed Processing Symposium (IPDPS{\textquoteright}17)},
	title = {Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL},
	venue = {Orlando, FL, USA},
	year = {2017}
}
Proceedings of the Algorithms and Complexity - 10th International Conference, CIAC 2017, Athens, Greece, May 2017
We investigate the multi-agent pathfinding (MAPF) problem with n agents on graphs with n vertices: Each agent has a unique start and goal vertex, with the objective of moving all agents in parallel movements to their goal s.t. each vertex and each edge may only be used by one agent at a time. We give a combinatorial classification of all graphs where this problem is solvable in general, including cases where the solvability depends on the initial agent placement. Furthermore, we present an algorithm solving the MAPF problem in our setting, requiring O(n2)O(n2) rounds, or O(n3)O(n3) moves of individual agents. Complementing these results, we show that there are graphs where Ω(n2)Ω(n2) rounds and Ω(n3)Ω(n3) moves are required for any algorithm.
@inproceedings{abc,
	abstract = {We investigate the multi-agent pathfinding (MAPF) problem with n agents on graphs with n vertices: Each agent has a unique start and goal vertex, with the objective of moving all agents in parallel movements to their goal s.t. each vertex and each edge may only be used by one agent at a time. We give a combinatorial classification of all graphs where this problem is solvable in general, including cases where the solvability depends on the initial agent placement.
Furthermore, we present an algorithm solving the MAPF problem in our setting, requiring O(n2)O(n2) rounds, or O(n3)O(n3) moves of individual agents. Complementing these results, we show that there are graphs where {\textohm}(n2){\textohm}(n2) rounds and {\textohm}(n3){\textohm}(n3) moves are required for any algorithm.},
	author = {Klaus-Tycho Foerster and Linus Groner and Torsten Hoefler and Michael Koenig and Sascha Schmid and Roger Wattenhofer},
	booktitle = {Proceedings of the Algorithms and Complexity - 10th International Conference, CIAC 2017},
	title = {Multi-agent Pathfinding with n Agents on Graphs with n Vertices: Combinatorial Classification and Tight Algorithmic Bounds},
	venue = {Athens, Greece},
	year = {2017}
}
Systems Group Master's Thesis, no. 168; Department of Computer Science, May 2017
Supervised by: Prof. Torsten Hoefler
Filed-programmable gate arrays (FPGA) are gaining interest in the high performance computing community due to their potential for high performance at low power. Programming FPGAs has traditionally been done by hardware engineers in languages working on the register transfer level. High-level synthesis opens FPGAs up to a wider audience, by facilitating the transformation of imperative code into hardware circuits. This thesis builds kernels from a C++ source code, exploiting higher level language features such as objects and templates to increase expresiveness and productivity. By modeling performance in terms of FPGA resources, a scalable matrixmatrix multiplication kernel is constructed. Performance and resource utilization are verified experimentally on an AlphaData 7V3 board, hosting a Xilinx Virtex-7 FPGA. For single precision floating point data, performance up to 95 GFLOP/s was measured. A single source-file solution is constructed, solving not only matrixmatrix multiplication, but also the all-pairs shortest path problem by substituting operations and data types. The blocked, hybrid CPUFPGA approach was used to gain further insights in resource utilization for integer data. These results demonstrate that HLS can indeed enable FPGA programming for high performance with little to no prior experience in hardware design.
@mastersthesis{abc,
	abstract = {Filed-programmable gate arrays (FPGA) are gaining interest in the high performance computing community due to their potential for high performance at low power. Programming FPGAs has traditionally been done by hardware engineers in languages working on the register transfer level. High-level synthesis opens FPGAs up to a wider audience, by facilitating the transformation of imperative code into hardware circuits. This thesis builds kernels from a C++ source code, exploiting higher level language features such as objects and templates to increase expresiveness and productivity.
By modeling performance in terms of FPGA resources, a scalable matrixmatrix multiplication kernel is constructed. Performance and resource utilization are verified experimentally on an AlphaData 7V3 board, hosting a Xilinx Virtex-7 FPGA. For single precision floating point data,
performance up to 95 GFLOP/s was measured.
A single source-file solution is constructed, solving not only matrixmatrix multiplication, but also the all-pairs shortest path problem by substituting operations and data types. The blocked, hybrid CPUFPGA approach was used to gain further insights in resource utilization for integer data. These results demonstrate that HLS can indeed enable FPGA programming for high performance with little to no prior experience in hardware design.},
	author = {Roman Cattaneo},
	school = {168},
	title = {High-level synthesis of dense matrix operations on FPGA},
	year = {2017}
}
Proceedings of the 16th Workshop on Hot Topics in Operating Systems, Whistler, BC, Canada, May 2017
It is time to reconsider memory protection. The emergence of large non-volatile main memories, scalable interconnects, and rack-scale computers running large numbers of small "micro services" creates significant challenges for memory protection based solely on MMU mechanisms. Central to this is a tension between protection and translation: optimizing for translation performance often comes with a cost in protection flexibility. We argue that a key-based memory protection scheme, complementary to but separate from regular page-level translation, is a better match for this new world. We present MaKC, a new architecture which combines two levels of capability-based protection to scale fine-grained memory protection at both user and kernel level to large numbers of protection domains without compromising efficiency at scale or ease of revocation.
@inproceedings{abc,
	abstract = {It is time to reconsider memory protection. The emergence of large non-volatile main memories, scalable interconnects, and rack-scale computers running large numbers of small "micro services" creates significant challenges for memory protection based solely on MMU mechanisms. Central to this is a tension between protection and translation: optimizing for translation performance often comes with a cost in protection flexibility.

We argue that a key-based memory protection scheme, complementary to but separate from regular page-level translation, is a better match for this new world. We present MaKC, a new architecture which combines two levels of capability-based protection to scale fine-grained memory protection at both user and kernel level to large numbers of protection domains without compromising efficiency at scale or ease of revocation.},
	author = {Reto Achermann and Chris Dalton and Paolo Faraboschi and Moritz Hoffmann and Dejan S. Milojicic and Geoffrey Ndu and Alexander Richardson and Timothy Roscoe and Adrian L. Shaw and Robert N. M. Watson},
	booktitle = {Proceedings of the 16th Workshop on Hot Topics in Operating Systems},
	title = {Separating Translation from Protection in Address Spaces with Dynamic Remapping},
	venue = {Whistler, BC, Canada},
	year = {2017}
}
Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD 2017, Chicago, IL, USA, May 2017
@inproceedings{abc,
	author = {Ce Zhang and Wentao Wu and Tian Li},
	booktitle = {Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD 2017, Chicago, IL, USA},
	title = {An Overreaction to the Broken Machine Learning Abstraction: The ease.ml Vision.},
	url = {http://doi.acm.org/10.1145/3077257.3077265},
	year = {2017}
}
25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, April 2017
@inproceedings{abc,
	author = {Muhsen Owaida and David Sidler and Kaan Kara and Gustavo Alonso},
	booktitle = {25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA},
	title = {Centaur: A Framework for Hybrid CPU-FPGA Databases.},
	url = {https://doi.org/10.1109/FCCM.2017.37},
	year = {2017}
}
Systems Group Master's Thesis, no. 164; Department of Computer Science, April 2017
Supervised by: Prof. Timothy Roscoe
In today's world, computer networks have grown immensely in terms of size and com- plexity. In an attempt to manage the dynamics and scale of these networks, software- defined networking was introduced, an approach that aims to supervise the behavior of the network through open interfaces. In order to deal with the challenges that arise in controlling software-defined networks, complicated analysis of the characteristics and structures of these networks has to be done. In particular, the network controller has to be capable of matching user-defined patterns, finding the shortest paths across the net- work and extracting subgraphs that conform to specific requirements. In this thesis, we present an Interface of the Property Graph Query Language for Differential Dataflow, capable of fulfilling these needs. The Property Graph Query Language enables us to express the three aforementioned tasks in a formal way, whereas Differential Dataflow provides the proper tools to conduct the analysis.
@mastersthesis{abc,
	abstract = {In today{\textquoteright}s world, computer networks have grown immensely in terms of size and com-
plexity. In an attempt to manage the dynamics and scale of these networks, software-
defined networking was introduced, an approach that aims to supervise the behavior
of the network through open interfaces. In order to deal with the challenges that arise
in controlling software-defined networks, complicated analysis of the characteristics and
structures of these networks has to be done. In particular, the network controller has to
be capable of matching user-defined patterns, finding the shortest paths across the net-
work and extracting subgraphs that conform to specific requirements. In this thesis, we
present an Interface of the Property Graph Query Language for Differential Dataflow,
capable of fulfilling these needs. The Property Graph Query Language enables us to
express the three aforementioned tasks in a formal way, whereas Differential Dataflow
provides the proper tools to conduct the analysis.},
	author = {Lukas Striebel},
	school = {164},
	title = {A High-level Graph Query Language Interface for Differential Dataflow},
	year = {2017}
}
Proceedings 2nd Workshop on Models for Formal Analysis of Real Systems, MARS@ETAPS 2017, Uppsala, Sweden, April 2017
@article{abc,
	author = {Reto Achermann and Lukas Humbel and David Cock and Timothy Roscoe},
	journal = {Proceedings 2nd Workshop on Models for Formal Analysis of Real Systems, MARS@ETAPS 2017, Uppsala, Sweden},
	title = {Formalizing Memory Accesses and Interrupts.},
	url = {http://dx.doi.org/10.4204/EPTCS.244.4},
	year = {2017}
}
25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, Napa, CA, USA, April 2017
Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme-called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.
@inproceedings{abc,
	abstract = {Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme-called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.},
	author = {Kaan Kara and Dan Alistarh and Gustavo Alonso and Onur Mutlu and Ce Zhang},
	booktitle = {25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA},
	title = {FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off.},
	url = {https://doi.org/10.1109/FCCM.2017.39},
	venue = {Napa, CA, USA},
	year = {2017}
}
Systems Group Master's Thesis, no. 162; Department of Computer Science, April 2017
Supervised by: Prof. Ce Zhang
Nowadays, major cloud providers provide machine learning services (a.k.a machine learning clouds) to customers. Microsoft Azure Machine Learning Studio and Amazon Machine Learning are two of the most popular machine learning clouds, which raise the level of abstraction for specifying machine learning models and tasks to ease the deployment of real-world machine learning applications. However, real-world machine learning applications are not simple in general, and raising the level of abstraction for machine learning systems rarely comes for free. Thus, an important question comes out. What is the performance of machine learning clouds on real-world machine learning problems? In order to answer this question, we first construct a benchmark data set MLbench with the top winning solutions of different competitions on Kaggle. And then present the results obtained by running MLbench on top of these two machine learning clouds evaluated by Kaggle. Our studying reveals the strength and limitations of these machine learning clouds, and also point out possible improvement directions.
@mastersthesis{abc,
	abstract = {Nowadays, major cloud providers provide machine learning services (a.k.a machine
learning clouds) to customers. Microsoft Azure Machine Learning Studio
and Amazon Machine Learning are two of the most popular machine learning
clouds, which raise the level of abstraction for specifying machine learning models
and tasks to ease the deployment of real-world machine learning applications.
However, real-world machine learning applications are not simple in general, and
raising the level of abstraction for machine learning systems rarely comes for
free. Thus, an important question comes out. What is the performance of machine
learning clouds on real-world machine learning problems? In order to answer this
question, we first construct a benchmark data set MLbench with the top winning
solutions of different competitions on Kaggle. And then present the results
obtained by running MLbench on top of these two machine learning clouds evaluated
by Kaggle. Our studying reveals the strength and limitations of these machine
learning clouds, and also point out possible improvement directions.},
	author = {Luyuan Zeng},
	school = {162},
	title = {Evaluate Machine Learning Clouds with Kaggle},
	year = {2017}
}
Systems Group Master's Thesis, no. 163; Department of Computer Science, April 2017
Supervised by: Prof. Ce Zhang
Training deep learning models has received tremendous research interest recently. In particular, there has been intensive research on reducing the communication cost of the training process when using multiple computational devices, through reducing the precision of the underlying data representation. Naturally, such methods induce system trade-offs—lowering communication precision could decrease communication overheads and improve scalability; but, on the other hand, it can also reduce the accuracy of training. In this paper, we study this trade-off space, and ask: Can lowprecision communication consistently improve the end-to-end performance of training modern neural networks, with no accuracy loss? From the performance point of view, the answer to this question may appear deceptively easy: compressing communication through low precision should help when the ratio between communication and computation is high. However, this answer is less straightforward when we try to generalize this principle across various neural network architectures (e.g., AlexNet vs. ResNet), number of GPUs (e.g., 2 GPUs vs. 8 GPUs), machine configurations (e.g., EC2 instances vs. NVIDIA DGX-1), programming models (e.g., MPI vs. NCCL), and even different GPU architectures (e.g., Kepler vs. Pascal). Currently, it is not clear how a realistic realization of all these factors maps to the speed up provided by low-precision communication. In this paper, we conduct an empirical study to answer this question and report the insights we obtain.
@mastersthesis{abc,
	abstract = {Training deep learning models has received tremendous research interest recently. In
particular, there has been intensive research on reducing the communication cost of
the training process when using multiple computational devices, through reducing
the precision of the underlying data representation. Naturally, such methods induce
system trade-offs{\textemdash}lowering communication precision could decrease communication
overheads and improve scalability; but, on the other hand, it can also reduce the
accuracy of training. In this paper, we study this trade-off space, and ask: Can lowprecision
communication consistently improve the end-to-end performance of training modern
neural networks, with no accuracy loss?
From the performance point of view, the answer to this question may appear deceptively
easy: compressing communication through low precision should help when the
ratio between communication and computation is high. However, this answer is less
straightforward when we try to generalize this principle across various neural network
architectures (e.g., AlexNet vs. ResNet), number of GPUs (e.g., 2 GPUs vs. 8 GPUs),
machine configurations (e.g., EC2 instances vs. NVIDIA DGX-1), programming models
(e.g., MPI vs. NCCL), and even different GPU architectures (e.g., Kepler vs. Pascal).
Currently, it is not clear how a realistic realization of all these factors maps to the speed
up provided by low-precision communication. In this paper, we conduct an empirical
study to answer this question and report the insights we obtain.},
	author = {Demjan Grubic},
	school = {163},
	title = {Communication-Scalable Machine Learning},
	year = {2017}
}
Proceedings of the Twelfth European Conference on Computer Systems, EuroSys 2017, Belgrade, Serbia, April 2017
@inproceedings{abc,
	author = {Zaheer Chothia and John Liagouris and Desislava Dimitrova and Timothy Roscoe},
	booktitle = {Proceedings of the Twelfth European Conference on Computer Systems, EuroSys 2017, Belgrade, Serbia},
	title = {Online Reconstruction of Structural Information from Datacenter Logs.},
	url = {http://doi.acm.org/10.1145/3064176.3064195},
	year = {2017}
}
Systems Group Master's Thesis, no. 167; Department of Computer Science, April 2017
Supervised by: Prof. Torsten Hoefler
Tapping the potential of emerging massively parallel architectures requires a thorough, systematic understanding of application performance. Graph processing, which lies at the heart of applications such as social network analysis, combinatorial algorithms, and machine learning, is especially challenging to model and reason about due to the rich structure of graphs. There is a need for algorithms, software abstractions, and models of computation that would enable robust, predictable performance on today’s petascale and future exascale systems. To address these needs, we use a precise yet practical set of models to survey the existing approaches and their limitations. Based on our insights, we propose that performance in graph processing can be largely captured by analyzing the graph representation, and suggest how to do so. We show how this paradigm is useful in designing and analyzing new algorithms, using triangle counting as a case study. Moreover, we present a case study of representations and design choices for iterative algorithms, demonstrating how our approach enables wellgrounded, scientific decisions. Finally, we design parameter-oblivious, communication-avoiding algorithms for global minimum cuts and connected components that builds on our insights. Experimental evidence confirms that the proposed techniques are practical: Our implementations scale up to thousands of cores and outperform comparable existing codes by up to three orders of magnitude. The representationcentric approach thus offers a useful and broadly applicable paradigm that both algorithm designers and developers can build on.
@mastersthesis{abc,
	abstract = {Tapping the potential of emerging massively parallel architectures requires
a thorough, systematic understanding of application performance.
Graph processing, which lies at the heart of applications such
as social network analysis, combinatorial algorithms, and machine
learning, is especially challenging to model and reason about due to
the rich structure of graphs. There is a need for algorithms, software
abstractions, and models of computation that would enable robust, predictable
performance on today{\textquoteright}s petascale and future exascale systems.
To address these needs, we use a precise yet practical set of models to
survey the existing approaches and their limitations. Based on our insights,
we propose that performance in graph processing can be largely
captured by analyzing the graph representation, and suggest how to
do so. We show how this paradigm is useful in designing and analyzing
new algorithms, using triangle counting as a case study. Moreover,
we present a case study of representations and design choices
for iterative algorithms, demonstrating how our approach enables wellgrounded,
scientific decisions. Finally, we design parameter-oblivious,
communication-avoiding algorithms for global minimum cuts and connected
components that builds on our insights. Experimental evidence
confirms that the proposed techniques are practical: Our implementations
scale up to thousands of cores and outperform comparable existing
codes by up to three orders of magnitude. The representationcentric
approach thus offers a useful and broadly applicable paradigm
that both algorithm designers and developers can build on.},
	author = {Pavel Kalvoda},
	school = {167},
	title = {Representation-Centric Graph Processing},
	year = {2017}
}
33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 2017
@inproceedings{abc,
	author = {Jie Jiang and Jiawei Jiang and Bin Cui and Ce Zhang},
	booktitle = {33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA},
	title = {TencentBoost: A Gradient Boosting Tree System with Parameter Server.},
	url = {https://doi.org/10.1109/ICDE.2017.87},
	year = {2017}
}
Systems Group Master's Thesis, no. 159; Department of Computer Science, March 2017
Supervised by: Prof. Ankit Singla
Recent work has indicated that any static data center network is fundamentally limited, due to its inability to move around network capacity. Is this truly the case, is the static network not flexible enough to handle varying (skewed) traffic scenarios through only traffic engineering? Can we only find refuge in dynamic topologies, introducing on-the-fly re-arrangement of network links at a cost? In pursuit of these research goals, three main data center topologies were evaluated: traditional (oversubscribed) fat-trees, expanders and dynamic topologies. Using flow optimality evaluation via a linear program, worst case traffic scenarios were identified and their respective performance measured under perfect traffic engineering. A custom discrete packet simulator was used to evaluate the two static topologies under a wide range of traffic scenarios and compare results to performance claims of recent dynamic topology research. It was found that the modeling of server up-links is crucial to a meaningful comparison...
@mastersthesis{abc,
	abstract = {Recent work has indicated that any static data center network is fundamentally
limited, due to its inability to move around network capacity.
Is this truly the case, is the static network not flexible enough to handle
varying (skewed) traffic scenarios through only traffic engineering?
Can we only find refuge in dynamic topologies, introducing on-the-fly
re-arrangement of network links at a cost? In pursuit of these research
goals, three main data center topologies were evaluated: traditional
(oversubscribed) fat-trees, expanders and dynamic topologies. Using
flow optimality evaluation via a linear program, worst case traffic scenarios
were identified and their respective performance measured under
perfect traffic engineering. A custom discrete packet simulator was
used to evaluate the two static topologies under a wide range of traffic
scenarios and compare results to performance claims of recent dynamic
topology research. It was found that the modeling of server up-links is
crucial to a meaningful comparison...},
	author = {Simon Kassing},
	school = {159},
	title = {Static Yet Flexible: Expander Data Center Network Fabrics Master Thesis},
	year = {2017}
}
Systems Group Master's Thesis, no. 160; Department of Computer Science, March 2017
Supervised by: Prof. Ce Zhang
We rst present an application where we helped our astrophysicist collaborators to recover features from arti cially degraded images with worse seeing and higher noise than the original with a performance which far exceeds simple deconvolution by training a generative adversarial network (GAN) on galaxy images. However, training time is limiting our potential to train on larger data sets. It takes 2 hours to train a GAN using 4105 galaxy images for 20 iterations on an NVIDIA TITAN X GPU. We ask the question: Can we speed up our machine learning training process by reducing the precision of data representation?...
@mastersthesis{abc,
	abstract = {We rst present an application where we helped our astrophysicist collaborators
to recover features from articially degraded images with worse
seeing and higher noise than the original with a performance which far exceeds
simple deconvolution by training a generative adversarial network
(GAN) on galaxy images. However, training time is limiting our potential
to train on larger data sets. It takes 2 hours to train a GAN using
4105 galaxy images for 20 iterations on an NVIDIA TITAN X GPU. We
ask the question: Can we speed up our machine learning training process
by reducing the precision of data representation?...},
	author = {Hantian  Zhang },
	school = {160},
	title = {The ZipML Framework for Training Models with End-to-End Low Precision},
	year = {2017}
}
Proceedings of the 2017 Digital Forensics Conference, Überlingen, Germany, March 2017
Digital forensic investigators often need to extract data from a seized device that contains NAND flash memory. Many such devices are physically damaged, preventing investigators from using automated techniques to extract the data stored within the device. Instead, investigators turn to chip-off analysis, where they use a thermal-based procedure to physically remove the NAND flash memory chip from the device, and access the chip directly to extract the raw data stored on the chip. We perform an analysis of the errors introduced into multi-level cell (MLC) NAND flash memory chips after the device has been seized. We make two major observations. First, between the time that a device is seized and the time digital forensic investigators perform data extraction, a large number of errors can be introduced as a result of charge leakage from the cells of the NAND flash memory (known as data retention errors). Second, when thermal-based chip removal is performed, the number of errors in the data stored within NAND flash memory can increase by two or more orders of magnitude, as the high temperature applied to the chip greatly accelerates charge leakage. We demonstrate that the chip-off analysis based forensic data recovery procedure is quite destructive, and can often render most of the data within NAND flash memory uncorrectable, and, thus, unrecoverable. To mitigate the errors introduced during the forensic recovery process, we explore a new hardware- based approach. We exploit a fine-grained read reference voltage control mechanism implemented in modern NAND flash memory chips, called read-retry, which can compensate for the charge leakage that occurs due to (1) retention loss and (2) thermal-based chip removal. The read-retry mechanism successfully reduces the number of errors, such that the original data can be fully recovered in our tested chips as long as the chips were not heavily used prior to seizure. We conclude that the read-retry mechanism should be adopted as part of the forensic data recovery process.
@inproceedings{abc,
	abstract = {Digital forensic investigators often need to extract data from a seized device that contains NAND flash memory. Many such devices are physically damaged, preventing investigators from using automated techniques to extract the data stored within the device. Instead, investigators turn to chip-off analysis, where they use a thermal-based procedure to physically remove the NAND flash memory chip from the device, and access the chip directly to extract the raw data stored on the chip.

We perform an analysis of the errors introduced into multi-level cell (MLC) NAND flash memory chips after the device has been seized. We make two major observations. First, between the time that a device is seized and the time digital forensic investigators perform data extraction, a large number of errors can be introduced as a result of charge leakage from the cells of the NAND flash memory (known as data retention errors). Second, when thermal-based chip removal is performed, the number of errors in the data stored within NAND flash memory can increase by two or more orders of magnitude, as the high temperature applied to the chip greatly accelerates charge leakage. We demonstrate that the chip-off analysis based forensic data recovery procedure is quite destructive, and can often render most of the data within NAND flash memory uncorrectable, and, thus, unrecoverable.

To mitigate the errors introduced during the forensic recovery process, we explore a new hardware- based approach. We exploit a fine-grained read reference voltage control mechanism implemented in modern NAND flash memory chips, called read-retry, which can compensate for the charge leakage that occurs due to (1) retention loss and (2) thermal-based chip removal. The read-retry mechanism successfully reduces the number of errors, such that the original data can be fully recovered in our tested chips as long as the chips were not heavily used prior to seizure. We conclude that the read-retry mechanism should be adopted as part of the forensic data recovery process. },
	author = {Aya Fukami and Saugata Ghose and Yixin Luo and Yu Cai and Onur Mutlu},
	booktitle = {Proceedings of the 2017 Digital Forensics Conference},
	title = {Improving the Reliability of Chip-Off Forensic Analysis of NAND Flash Memory Devices},
	venue = {{\"U}berlingen, Germany},
	year = {2017}
}
Passive and Active Measurement - 18th International Conference, PAM 2017, Sydney, NSW, Australia, March 2017
@inproceedings{abc,
	author = {Ilker Nadi Bozkurt and Anthony Aguirre and Balakrishnan Chandrasekaran and Brighten Godfrey and Gregory Laughlin and Bruce M. Maggs and Ankit Singla},
	booktitle = {Passive and Active Measurement - 18th International Conference, PAM 2017, Sydney, NSW, Australia},
	title = {Why Is the Internet so Slow?!},
	url = {http://dx.doi.org/10.1007/978-3-319-54328-4_13},
	year = {2017}
}
Systems Group Master's Thesis, no. 161; Department of Computer Science, March 2017
Supervised by: Prof. Timothy Roscoe
Current mainstream processors provide multiple SMT (i.e., simultaneous multithreading) lanes on top of each physical core. These hardware threads share more resources (e.g., execution units and caches) when compared to CPU cores, but are managed by operating systems in the same way as if they were separate physical cores. This Thesis explores the interaction between hardware threads and proposes an extension to the Barrel sh OS, meant to improve the performance of a system by adequately handling SMT lanes. On an Intel Haswell CPU, with 2-way SMT via Hyper-Threading Technology, each SMT lane had 2=3 of the processing power that was yielded by the physical core with a single active hardware thread. The multi-HT CPU Driver (i.e., Barrel sh's microker-nel) is able to modify the set of active hardware threads with an overhead in the order of thousands of processor cycles, which means that it can quickly adapt to the parallelism exhibited by the workload.
@mastersthesis{abc,
	abstract = {Current mainstream processors provide multiple SMT (i.e., simultaneous multithreading) lanes on top of each physical core. These hardware threads share more resources (e.g., execution units and caches) when compared to CPU cores, but are managed by operating systems in the same way as if they were separate physical cores. This Thesis explores the interaction between hardware threads and proposes an extension to the Barrelsh OS, meant to improve the performance of a system by adequately handling SMT lanes. On an Intel Haswell CPU, with 2-way SMT via Hyper-Threading Technology, each SMT lane had 2=3 of the processing power that was yielded by the physical core with a single active hardware thread. The multi-HT CPU Driver (i.e., Barrelsh{\textquoteright}s microker-nel) is able to modify the set of active hardware threads with an overhead in the order of thousands of processor cycles, which means that it can quickly adapt to the parallelism exhibited by the workload.},
	author = {Andrei Poenaru},
	school = {161},
	title = {Explicit OS support for hardware threads},
	year = {2017}
}
14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 2017
@inproceedings{abc,
	author = {Kevin Hsieh and Aaron Harlap and Nandita Vijaykumar and Dimitris Konomis and Gregory R. Ganger and Phillip B. Gibbons and Onur Mutlu},
	booktitle = {14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA},
	title = {Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.},
	url = {https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh},
	year = {2017}
}
Design, Automation Test in Europe Conference Exhibition, DATE 2017, Lausanne, Switzerland, March 2017
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {Design, Automation  Test in Europe Conference  Exhibition, DATE 2017, Lausanne, Switzerland},
	title = {The RowHammer problem and other issues we may face as memory becomes denser.},
	url = {https://doi.org/10.23919/DATE.2017.7927156},
	year = {2017}
}
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, California, USA., February 2017
@inproceedings{abc,
	author = {Besmira Nushi and Ece Kamar and Eric Horvitz and Donald Kossmann},
	booktitle = {Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence},
	title = {On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems.},
	url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/15032},
	venue = {San Francisco, California, USA.},
	year = {2017}
}
2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 2017
@inproceedings{abc,
	author = {Hasan Hassan and Nandita Vijaykumar and Samira Manabi Khan and Saugata Ghose and Kevin K. Chang and Gennady Pekhimenko and Donghyuk Lee and Oguz Ergin and Onur Mutlu},
	booktitle = {2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA},
	title = {SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.},
	url = {https://doi.org/10.1109/HPCA.2017.62},
	year = {2017}
}
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Austin, TX, USA, February 2017
Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a task-based program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.
@inproceedings{abc,
	abstract = {Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a task-based program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.},
	author = {Sergei Shudler and Alexandru Calotoiu and Torsten Hoefler and Felix Wolf},
	booktitle = {Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
	title = {Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications},
	venue = {Austin, TX, USA},
	year = {2017}
}
2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 2017
@inproceedings{abc,
	author = {Yu Cai and Saugata Ghose and Yixin Luo and Ken Mai and Onur Mutlu and Erich F. Haratsch},
	booktitle = {2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA},
	title = {Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques.},
	url = {https://doi.org/10.1109/HPCA.2017.61},
	year = {2017}
}
CoRR, January 2017
@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Long Sun and Onur Mutlu},
	booktitle = {CoRR},
	title = {Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency.},
	url = {http://arxiv.org/abs/1705.03623},
	year = {2017}
}
Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs Datenbanken und Informationssysteme" (DBIS), 6.-10. März 2017, Stuttgart, Germany, Proceedings, January 2017
@inproceedings{abc,
	author = {Donald Kossmann},
	booktitle = {Datenbanksysteme f{\"u}r Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs Datenbanken und Informationssysteme" (DBIS), 6.-10. M{\"a}rz 2017, Stuttgart, Germany, Proceedings},
	title = {Confidentiality {\`a} la Carte with Cipherbase.},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Yu Cai and Saugata Ghose and Erich F. Haratsch and Yixin Luo and Onur Mutlu},
	journal = {CoRR},
	title = {Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives.},
	url = {http://arxiv.org/abs/1706.08642},
	year = {2017}
}
CoRR, January 2017
@inproceedings{abc,
	author = {Heng Guo and Kaan Kara and Ce Zhang},
	booktitle = {CoRR},
	title = {Layerwise Systematic Scan: Deep Boltzmann Machines and Beyond.},
	url = {http://arxiv.org/abs/1705.05154},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Yixin Luo and Saugata Ghose and Tianshi Li and Sriram Govindan and Bikash Sharma and Bryan Kelly and Amirali Boroumand and Onur Mutlu},
	journal = {CoRR},
	title = {Using ECC DRAM to Adaptively Increase Memory Capacity.},
	url = {http://arxiv.org/abs/1706.08870},
	year = {2017}
}
CoRR, January 2017
@inproceedings{abc,
	author = {Kevin K. Chang and Abdullah Giray Yaglik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	booktitle = {CoRR},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Chips: Characterization, Analysis, and Mechanisms.},
	url = {http://arxiv.org/abs/1705.10292},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Amirali Boroumand and Saugata Ghose and Minesh Patel and Hasan Hassan and Brandon Lucia and Nastaran Hajinazar and Kevin Hsieh and Krishna T. Malladi and Hongzhong Zheng and Onur Mutlu},
	journal = {CoRR},
	title = {LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures.},
	url = {http://arxiv.org/abs/1706.03162},
	year = {2017}
}
CoRR, January 2017
@inproceedings{abc,
	author = {Xiangru Lian and Ce Zhang and Huan Zhang and Cho-Jui Hsieh and Wei Zhang and Ji Liu},
	booktitle = {CoRR},
	title = {Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent.},
	url = {http://arxiv.org/abs/1705.09056},
	year = {2017}
}
Commun. ACM, -, January 2017
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
@inproceedings{abc,
	abstract = {The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.},
	author = {Ce Zhang and Christopher R{\'e} and Michael J. Cafarella and Jaeho Shin and Feiran Wang and Sen Wu},
	booktitle = {Commun. ACM},
	title = {DeepDive: declarative knowledge base construction.},
	url = {http://doi.acm.org/10.1145/3060586},
	venue = {-},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Onur Mutlu},
	journal = {CoRR},
	title = {The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser.},
	url = {http://arxiv.org/abs/1703.00626},
	year = {2017}
}
Advances in Computers, January 2017
@article{abc,
	author = {Vivek Seshadri and Onur Mutlu},
	journal = {Advances in Computers},
	title = {Chapter Four - Simple Operations in Memory to Reduce Data Movement.},
	url = {https://doi.org/10.1016/bs.adcom.2017.04.004},
	year = {2017}
}
CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 2017
@inproceedings{abc,
	author = {Badrish Chandramouli and Johannes Gehrke and Jonathan Goldstein and Donald Kossmann and Justin J. Levandoski and Renato Marroquin and Wenlei Xie},
	booktitle = {CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA},
	title = {READY: Completeness is in the Eye of the Beholder.},
	url = {http://cidrdb.org/cidr2017/papers/p18-chandramouli-cidr17.pdf},
	year = {2017}
}
POMACS, January 2017
@article{abc,
	author = {Kevin K. Chang and A. Giray Yaalik{\c c}i and Saugata Ghose and Aditya Agrawal and Niladrish Chatterjee and Abhijith Kashyap and Donghyuk Lee and Mike O{\textquoteright}Connor and Hasan Hassan and Onur Mutlu},
	journal = {POMACS},
	title = {Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms.},
	url = {http://doi.acm.org/10.1145/3084447},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Lucas Braun and Renato Marroquin and Kai-En Tsay and Donald Kossmann},
	journal = {CoRR},
	title = {MTBase: Optimizing Cross-Tenant Database Queries.},
	url = {http://arxiv.org/abs/1703.04290},
	year = {2017}
}
PVLDB, January 2017
@inproceedings{abc,
	author = {Claude Barthels and Gustavo Alonso and Torsten Hoefler and Timo Schneider and Ingo M{\"u}ller},
	booktitle = {PVLDB},
	title = {Distributed Join Algorithms on Thousands of Cores.},
	url = {http://www.vldb.org/pvldb/vol10/p517-barthels.pdf},
	year = {2017}
}
POMACS, January 2017
@article{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Saugata Ghose and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu},
	journal = {POMACS},
	title = {Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.},
	url = {http://doi.acm.org/10.1145/3084464},
	year = {2017}
}
PVLDB, January 2017
@inproceedings{abc,
	author = {Zhipeng Zhang and Yingxia Shao and Bin Cui and Ce Zhang},
	booktitle = {PVLDB},
	title = {An Experimental Evaluation of SimRank-based Similarity Search Algorithms.},
	url = {http://www.vldb.org/pvldb/vol10/p601-zhang.pdf},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Xiangyao Yu and Christopher J. Hughes and Nadathur Satish and Onur Mutlu and Srinivas Devadas},
	journal = {CoRR},
	title = {Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation.},
	url = {http://arxiv.org/abs/1704.02677},
	year = {2017}
}
VLDB J., January 2017
Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.
@article{abc,
	abstract = {Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.},
	author = {Christopher De Sa and Alexander Ratner and Christopher R{\'e} and Jaeho Shin and Feiran Wang and Sen Wu and Ce Zhang},
	journal = {VLDB J.},
	title = {Incremental knowledge base construction using DeepDive.},
	url = {http://dx.doi.org/10.1007/s00778-016-0437-2},
	year = {2017}
}
IEEE Data Eng. Bull., January 2017
High-throughput, low-latency networks are becoming a key element in database appliances and data processing systems to reduce the overhead of data movement. In this article, we focus on Remote Direct Memory Access (RDMA), a feature increasingly available in modern networks enabling the network card to directly write to and read from main memory. RDMA has started to attract attention as a technical solution to quite a few performance bottlenecks in distributed data management but there is still much work to be done to make it an effective technology suitable for database engines. In this article, we identify several advantages and drawbacks of RDMA and related technologies, and propose new communication primitives that would bridge the gap between the operations provided by high-speed networks and the needs of data processing systems.
@article{abc,
	abstract = {High-throughput, low-latency networks are becoming a key element in database appliances and data processing systems to reduce the overhead of data movement. In this article, we focus on Remote Direct Memory Access (RDMA), a feature increasingly available in modern networks enabling the network card to directly write to and read from main memory. RDMA has started to attract attention as a technical solution to quite a few performance bottlenecks in distributed data management but there is still much work to be done to make it an effective technology suitable for database engines. In this article, we identify several advantages and drawbacks of RDMA and related technologies, and propose new communication primitives that would bridge the gap between the operations provided by high-speed networks and the needs of data processing systems.},
	author = {Claude Barthels and Gustavo Alonso and Torsten Hoefler},
	journal = {IEEE Data Eng. Bull.},
	title = {Designing Databases for Future High-Performance Networks.},
	url = {http://sites.computer.org/debull/A17mar/p15.pdf},
	year = {2017}
}
CoRR, January 2017
@article{abc,
	author = {Kevin Schawinski and Ce Zhang and Hantian Zhang and Lucas Fowler and Gokula Krishnan Santhanam},
	journal = {CoRR},
	title = {Generative Adversarial Networks recover features in astrophysical images of galaxies beyond the deconvolution limit.},
	url = {http://arxiv.org/abs/1702.00403},
	year = {2017}
}
Computer Architecture Letters, January 2017
@inproceedings{abc,
	author = {Amirali Boroumand and Saugata Ghose and Minesh Patel and Hasan Hassan and Brandon Lucia and Kevin Hsieh and Krishna T. Malladi and Hongzhong Zheng and Onur Mutlu},
	booktitle = {Computer Architecture Letters},
	title = {LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory.},
	url = {https://doi.org/10.1109/LCA.2016.2577557},
	year = {2017}
}

2016

2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 2016
@inproceedings{abc,
	author = {Heqing Huang and Cong Zheng and Junyuan Zeng and Wu Zhou and Sencun Zhu and Peng Liu and Suresh Chari and Ce Zhang},
	booktitle = {2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA},
	title = {Android malware development on public malware scanning platforms: A large-scale data-driven study.},
	url = {http://dx.doi.org/10.1109/BigData.2016.7840712},
	year = {2016}
}
Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER) , Taipei, Taiwan, December 2016
Tuning large applications requires a clever exploration of the design and configuration space. Especially on supercomputers, this space is so large that its exhaustive traversal via performance experiments becomes too expensive, if not impossible. Manually creating analytical performance models provides insights into optimization opportunities but is extremely laborious if done for applications of realistic size. If we must consider multiple performance-relevant parameters and their possible interactions, a common requirement, this task becomes even more complex. We build on previous work on automatic scalability modeling and significantly extend it to allow insightful modeling of any combination of application execution parameters. Multi-parameter modeling has so far been outside the reach of automatic methods due to the exponential growth of the model search space. We develop a new technique to traverse the search space rapidly and generate insightful performance models that enable a wide range of uses from performance predictions for balanced machine design to performance tuning.
@inproceedings{abc,
	abstract = {Tuning large applications requires a clever exploration of the design and configuration space. Especially on supercomputers, this space is so large that its exhaustive traversal via performance experiments becomes too expensive, if not impossible. Manually creating analytical performance models provides insights into optimization opportunities but is extremely laborious if done for applications of realistic size. If we must consider multiple performance-relevant parameters and their possible interactions, a common requirement, this task becomes even more complex. We build on previous work on automatic scalability modeling and significantly extend it to allow insightful modeling of any combination of application execution parameters. Multi-parameter modeling has so far been outside the reach of automatic methods due to the exponential growth of the model search space. We develop a new technique to traverse the search space rapidly and generate insightful performance models that enable a wide range of uses from performance predictions for balanced machine design to performance tuning.},
	author = {Alexandru Calotoiu and David Beckinsale and Christopher W. Earl and Torsten Hoefler and Ian Karlin and Martin Schulz and Felix Wolf},
	booktitle = {Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER) },
	title = {Fast Multi-Parameter Performance Modeling},
	venue = {Taipei, Taiwan},
	year = {2016}
}
Systems Group Master's Thesis, no. 156; Department of Computer Science, December 2016
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Jingyi Wang},
	school = {156},
	title = {Performance Analysis of Decision Tree Learning Algorithms on Multicore CPUs},
	year = {2016}
}
Proceedings of the IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI), Santa Clara, CA, USA, December 2016
Lossless networks, such as InfiniBand use flow-control to avoid packet-loss due to congestion. This introduces dependencies between input and output channels, in case of cyclic dependencies the network can deadlock. Deadlocks can be resolved by splitting a physical channel into multiple virtual channels with independent buffers and credit systems. Currently available routing engines for InfiniBand assign entire paths from source to destination nodes to different virtual channels. However, InfiniBand allows changing the virtual channel at every switch. We developed fast routing engines which make use of that fact and map individual hops to virtual channels. Our algorithm imposes a total order on virtual channels and increments the virtual channel at every hop, thus the diameter of the network is an upper bound for the required number of virtual channels. We integrated this algorithm into the InfiniBand software stack. Our algorithms provide deadlock free routing on state-of-the-art low-diameter topologies, using fewer virtual channels than currently available practical approaches, while being faster by a factor of four on large networks. Since low-diameter topologies are common among the largest supercomputers in the world, to provide deadlock-free routing for such systems is very important.
@inproceedings{abc,
	abstract = {Lossless networks, such as InfiniBand use flow-control to avoid packet-loss due to congestion. This introduces dependencies between input and output channels, in case of cyclic dependencies the network can deadlock. Deadlocks can be resolved by splitting a physical channel into multiple virtual channels with independent buffers and credit systems. Currently available routing engines for InfiniBand assign entire paths from source to destination nodes to different virtual channels. However, InfiniBand allows changing the virtual channel at every switch. We developed fast routing engines which make use of that fact and map individual hops to virtual channels. Our algorithm imposes a total order on virtual channels and increments the virtual channel at every hop, thus the diameter of the network is an upper bound for the required number of virtual channels. We integrated this algorithm into the InfiniBand software stack. Our algorithms provide deadlock free routing on state-of-the-art low-diameter topologies, using fewer virtual channels than currently available practical approaches, while being faster by a factor of four on large networks. Since low-diameter topologies are common among the largest supercomputers in the world, to provide deadlock-free routing for such systems is very important.},
	author = {Timo Schneider and Otto Bibartiu and Torsten Hoefler},
	booktitle = {Proceedings of the IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI)},
	title = {Ensuring Deadlock-Freedom in Low-Diameter InfiniBand Networks},
	venue = {Santa Clara, CA, USA},
	year = {2016}
}
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, November 2016
The goal of the extreme scale plasma turbulence studies described in this paper is to expedite the delivery of reliable predictions on confinement physics in large magnetic fusion systems by using world-class supercomputers to carry out simulations with unprecedented resolution and temporal duration. This has involved architecture-dependent optimizations of performance scaling and addressing code portability and energy issues, with the metrics for multi-platform comparisons being "time-to-solution" and "energy-to-solution". Realistic results addressing how confinement losses caused by plasma turbulence scale from present-day devices to the much larger $25 billion international ITER fusion facility have been enabled by innovative advances in the GTC-P code including (i) implementation of one-sided communication from MPI 3.0 standard; (ii) creative optimization techniques on Xeon Phi processors; and (iii) development of a novel performance model for the key kernels of the PIC code. Results show that modeling data movement is sufficient to predict performance on modern supercomputer platforms.
@inproceedings{abc,
	abstract = {The goal of the extreme scale plasma turbulence studies described in this paper is to expedite the delivery of reliable predictions on confinement physics in large magnetic fusion systems by using world-class supercomputers to carry out simulations with unprecedented resolution and temporal duration. This has involved architecture-dependent optimizations of performance scaling and addressing code portability and energy issues, with the metrics for multi-platform comparisons being "time-to-solution" and "energy-to-solution". Realistic results addressing how confinement losses caused by plasma turbulence scale from present-day devices to the much larger $25 billion international ITER fusion facility have been enabled by innovative advances in the GTC-P code including (i) implementation of one-sided communication from MPI 3.0 standard; (ii) creative optimization techniques on Xeon Phi processors; and (iii) development of a novel performance model for the key kernels of the PIC code. Results show that modeling data movement is sufficient to predict performance on modern supercomputer platforms.},
	author = {William Tang and Stephane Ethier and Grzegorz Kwasniewski and Torsten Hoefler and Khaled Z. Ibrahim and Kamesh Madduri and Samuel Williams and Leonid Oliker and Carlos Rosales-Fernandez and Tim Williams},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
	title = {Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide},
	venue = {Salt Lake City, UT, USA},
	year = {2016}
}
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, November 2016
The interconnection network has a large influence on total cost, application performance, energy consumption, and overall system efficiency of a supercomputer. Unfortunately, today's routing algorithms do not utilize this important resource most efficiently. We first demonstrate this by defining the dark fiber metric as a measure of unused resource in networks. To improve the utilization, we propose scheduling-aware routing, a new technique that uses the current state of the batch system to determine a new set of network routes and so increases overall system utilization by up to 17.74%. We also show that our proposed routing increases the throughput of communication benchmarks by up to 17.6% on a practical InfiniBand installation. Our routing method is implemented in the standard InfiniBand tool set and can immediately be used to optimize systems. In fact, we are using it to improve the utilization of our production petascale supercomputer for more than one year.
@inproceedings{abc,
	abstract = {The interconnection network has a large influence on total cost, application performance, energy consumption, and overall system efficiency of a supercomputer. Unfortunately, today{\textquoteright}s routing algorithms do not utilize this important resource most efficiently. We first demonstrate this by defining the dark fiber metric as a measure of unused resource in networks. To improve the utilization, we propose scheduling-aware routing, a new technique that uses the current state of the batch system to determine a new set of network routes and so increases overall system utilization by up to 17.74\%. We also show that our proposed routing increases the throughput of communication benchmarks by up to 17.6\% on a practical InfiniBand installation. Our routing method is implemented in the standard InfiniBand tool set and can immediately be used to optimize systems. In fact, we are using it to improve the utilization of our production petascale supercomputer for more than one year.},
	author = {Jens Domke and Torsten Hoefler},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
	title = {Scheduling-Aware Routing for Supercomputers},
	venue = {Salt Lake City, UT, USA},
	year = {2016}
}
4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, INFLOW@OSDI 2016, Savannah, GA, USA, November 2016
@inproceedings{abc,
	author = {Himanshu Chauhan and Irina Calciu and Vijay Chidambaram and Eric Schkufza and Onur Mutlu and Pratap Subrahmanyam},
	booktitle = {4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, INFLOW@OSDI 2016, Savannah, GA, USA},
	title = {NVMOVE: Helping Programmers Move to Byte-Based Persistence.},
	url = {https://www.usenix.org/conference/inflow16/workshop-program/presentation/chauhan},
	year = {2016}
}
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, November 2016
Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks.
@inproceedings{abc,
	abstract = {Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks.},
	author = {Tobias Gysi and Jeremia B{\"a}r and Torsten Hoefler},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
	title = {dCUDA: Hardware Supported Overlap of Computation and Communication},
	venue = {Salt Lake City, UT, USA},
	year = {2016}
}
12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2016
@inproceedings{abc,
	author = {Khanh Nguyen and Lu Fang and Guoqing (Harry) Xu and Brian Demsky and Shan Lu and Sanazsadat Alamian and Onur Mutlu},
	booktitle = {12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA},
	title = {Yak: A High-Performance Big-Data-Friendly Garbage Collector.},
	url = {https://www.usenix.org/conference/osdi16/technical-sessions/presentation/nguyen},
	year = {2016}
}
Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, Amsterdam, Netherlands, November 2016
Recent advances in networking hardware have led to a new generation of Remote Memory Access (RMA) networks in which processors from different machines can communicate directly, bypassing the operating system and allowing higher performance. Researchers and practitioners have proposed libraries and programming models for RMA to enable the development of applications running on these networks, However, the memory models implied by these RMA libraries and languages are often loosely specified, poorly understood, and differ depending on the underlying network architecture and other factors. Hence, it is difficult to precisely reason about the semantics of RMA programs or how changes in the network architecture affect them. We address this problem with the following contributions: (i) a coreRMA language which serves as a common foundation, formalizing the essential characteristics of RMA programming; (ii) complete axiomatic semantics for that language; (iii) integration of our semantics with an existing constraint solver, enabling us to exhaustively generate coreRMA programs (litmus tests) up to a specified bound and check whether the tests satisfy their specification; and (iv) extensive validation of our semantics on real-world RMA systems. We generated and ran 7441 litmus tests using each of the low-level RMA network APIs: DMAPP, VPI Verbs, and Portals 4. Our results confirmed that our model successfully captures behaviors exhibited by these networks. Moreover, we found RMA programs that behave inconsistently with existing documentation, confirmed by network experts. Our work provides an important step towards understanding existing RMA networks, thus influencing the design of future RMA interfaces and hardware.
@inproceedings{abc,
	abstract = {Recent advances in networking hardware have led to a new generation of Remote Memory Access (RMA) networks in which processors from different machines can communicate directly, bypassing the operating system and allowing higher performance. Researchers and practitioners have proposed libraries and programming models for RMA to enable the development of applications running on these networks,

However, the memory models implied by these RMA libraries and languages are often loosely specified, poorly understood, and differ depending on the underlying network architecture and other factors. Hence, it is difficult to precisely reason about the semantics of RMA programs or how changes in the network architecture affect them.

We address this problem with the following contributions: (i) a coreRMA language which serves as a common foundation, formalizing the essential characteristics of RMA programming; (ii) complete axiomatic semantics for that language; (iii) integration of our semantics with an existing constraint solver, enabling us to exhaustively generate coreRMA programs (litmus tests) up to a specified bound and check whether the tests satisfy their specification; and (iv) extensive validation of our semantics on real-world RMA systems. We generated and ran 7441 litmus tests using each of the low-level RMA network APIs: DMAPP, VPI Verbs, and Portals 4. Our results confirmed that our model successfully captures behaviors exhibited by these networks. Moreover, we found RMA programs that behave inconsistently with existing documentation, confirmed by network experts.

Our work provides an important step towards understanding existing RMA networks, thus influencing the design of future RMA interfaces and hardware.},
	author = {Andrei Marian Dan and Patrick Lam and Torsten Hoefler and Martin Vechev},
	booktitle = {Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications},
	title = {Modeling and Analysis of Remote Memory Access Programming},
	venue = {Amsterdam, Netherlands},
	year = {2016}
}
12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2016
@inproceedings{abc,
	author = {Stefan Kaestle and Reto Achermann and Roni Haecki and Moritz Hoffmann and Sabela Ramos and Timothy Roscoe},
	booktitle = {12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA},
	title = {Machine-Aware Atomic Broadcast Trees for Multicores.},
	url = {https://www.usenix.org/conference/osdi16/technical-sessions/presentation/kaestle},
	year = {2016}
}
Systems Group Master's Thesis, no. 155; Department of Computer Science, November 2016
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Sebastian Wicki},
	school = {155},
	title = {An Online Stream Processor for Timely Dataflow},
	year = {2016}
}
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 2016
@inproceedings{abc,
	author = {Sangeetha Abdu Jyothi and Ankit Singla and Brighten Godfrey and Alexandra Kolla},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA},
	title = {Measuring and understanding throughput of network topologies.},
	url = {https://doi.org/10.1109/SC.2016.64},
	year = {2016}
}
Proceedings of the 15th ACM Workshop on Hot Topics in Networks, HotNets 2016, Atlanta, GA, USA, November 2016
@inproceedings{abc,
	author = {Ankit Singla},
	booktitle = {Proceedings of the 15th ACM Workshop on Hot Topics in Networks, HotNets 2016, Atlanta, GA, USA},
	title = {Fat-FREE Topologies.},
	url = {http://doi.acm.org/10.1145/3005745.3005747},
	year = {2016}
}
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, November 2016
MeteoSwiss, the Swiss national weather forecast institute, has selected densely populated accelerator servers as their primary system to compute weather forecast simulation. Servers with multiple accelerator devices that are primarily connected by a PCI-Express (PCIe) network achieve a significantly higher energy efficiency. Memory transfers between accelerators in such a system are subjected to PCIe arbitration policies. In this paper, we study the impact of PCIe topology and develop a congestion-aware performance model for PCIe communication. We present an algorithm for computing congestion factors of every communication in a congestion graph that characterizes the dynamic usage of network resources by an application. Our model applies to any PCIe tree topology. Our validation results on two different topologies of 8 GPU devices demonstrate that our model achieves an accuracy of over 97% within the PCIe network. We demonstrate the model on a weather forecast application to identify the best algorithms for its communication patterns among GPUs.
@inproceedings{abc,
	abstract = {MeteoSwiss, the Swiss national weather forecast institute, has selected densely populated accelerator servers as their primary system to compute weather forecast simulation. Servers with multiple accelerator devices that are primarily connected by a PCI-Express (PCIe) network achieve a significantly higher energy efficiency. Memory transfers between accelerators in such a system are subjected to PCIe arbitration policies. In this paper, we study the impact of PCIe topology and develop a congestion-aware performance model for PCIe communication. We present an algorithm for computing congestion factors of every communication in a congestion graph that characterizes the dynamic usage of network resources by an application. Our model applies to any PCIe tree topology. Our validation results on two different topologies of 8 GPU devices demonstrate that our model achieves an accuracy of over 97\% within the PCIe network. We demonstrate the model on a weather forecast application to identify the best algorithms for its communication patterns among GPUs.},
	author = {Maxime Martinasso and Grzegorz Kwasniewski and Sadaf R. Alam},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
	title = {A PCIe congestion-aware performance model for densely populated accelerator servers},
	venue = {Salt Lake City, UT, USA},
	year = {2016}
}
49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 2016
@inproceedings{abc,
	author = {Nandita Vijaykumar and Kevin Hsieh and Gennady Pekhimenko and Samira Manabi Khan and Ashish Shrestha and Saugata Ghose and Adwait Jog and Phillip B. Gibbons and Onur Mutlu},
	booktitle = {49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan},
	title = {Zorua: A holistic approach to resource virtualization in GPUs.},
	url = {http://dx.doi.org/10.1109/MICRO.2016.7783718},
	year = {2016}
}
49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 2016
@inproceedings{abc,
	author = {Milad Hashemi and Onur Mutlu and Yale N. Patt},
	booktitle = {49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan},
	title = {Continuous runahead: Transparent hardware acceleration for memory intensive workloads.},
	url = {http://dx.doi.org/10.1109/MICRO.2016.7783764},
	year = {2016}
}
34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA, October 2016
@inproceedings{abc,
	author = {Kevin Hsieh and Samira Manabi Khan and Nandita Vijaykumar and Kevin K. Chang and Amirali Boroumand and Saugata Ghose and Onur Mutlu},
	booktitle = {34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA},
	title = {Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation.},
	url = {http://dx.doi.org/10.1109/ICCD.2016.7753257},
	year = {2016}
}
2016 International Symposium on Rapid System Prototyping, RSP 2016, Pittsburg, PA, USA, October 2016
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {2016 International Symposium on Rapid System Prototyping, RSP 2016, Pittsburg, PA, USA},
	title = {Keynote: rethinking memory system design.},
	url = {https://doi.org/10.1145/2990299.2990300},
	year = {2016}
}
34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA, October 2016
@inproceedings{abc,
	author = {Xi-Yue Xiang and Saugata Ghose and Onur Mutlu and Nian-Feng Tzeng},
	booktitle = {34th IEEE International Conference on Computer Design, ICCD 2016, Scottsdale, AZ, USA},
	title = {A model for Application Slowdown Estimation in on-chip networks and its use for improving system fairness and performance.},
	url = {http://dx.doi.org/10.1109/ICCD.2016.7753327},
	year = {2016}
}
Systems Group Master's Thesis, no. 154; Department of Computer Science, September 2016
Supervised by: Prof. Timothy Roscoe
Identifying performance bottlenecks of modern, distributed stream processing engines is a serious challenge. At the same time, such engines are widely used to perform data-parallel tasks such as machine learning, graph processing or sophisticated streaming data analysis. Oftentimes, these tasks are required to produce low-latency results as well as to achieve a high throughput. As computations often consist of complex, iterative dataflows and are distributed over multiple physical machines — with hundreds of worker threads in total — finding the source of a performance problem is a difficult task. While profiling can be used to quantify the time spent in the various steps of a parallel computation, it does not take into account the dependencies between the steps. As a result, optimization efforts are often wasted on components that have little to no influence on query latency or the overall runtime of a program. In this work, we offer a more effective alternative by applying critical path analysis, a dependency-aware technique. The critical path is defined as the longest sequence of dependent steps in a parallel program’s execution. Any increase in the execution time of a step on the critical path will therefore result in an equal increase in the total runtime of the computation. We refine existing critical path-based models and apply them to data-parallel systems, which often share common low-level principles. We provide guidelines on the instrumentation necessary to apply our model, as well as a set of trace properties that help verify the correctness of that instrumentation. Furthermore, we develop a novel method to identify phases in a worker thread’s execution during which it is waiting — e.g. for a message from a different worker—even in the absence of blocking system calls. Through critical path analysis, we can then identify performance bottlenecks in system components, dataflow operators as well as in network communication. To demonstrate our ideas, we implemented a prototype system capable of performing a critical path analysis of the Timely Dataflow stream processing engine. We show that our system can effectively identify the factors limiting a data-parallel computation’s overall performance. Furthermore, we demonstrate that our analysis is both efficient and scalable, and can even be performed in real-time in certain configurations.
@mastersthesis{abc,
	abstract = {Identifying performance bottlenecks of modern, distributed stream processing engines is a serious challenge. At the same time, such engines are widely used to perform data-parallel tasks such as machine learning, graph processing or sophisticated streaming data analysis. Oftentimes, these tasks are required to produce low-latency results as well as to achieve a high throughput. As computations often consist of complex, iterative dataflows and are distributed over multiple physical machines {\textemdash} with hundreds of worker threads in total {\textemdash} finding the source of a performance problem is a difficult task. While profiling can be used to quantify the time spent in the various steps of a parallel computation, it does not take into account the dependencies between the steps. As a result, optimization efforts are often wasted on components that have little to no influence on query latency or the overall runtime of a program.
In this work, we offer a more effective alternative by applying critical path analysis, a dependency-aware technique. The critical path is defined as the longest sequence of dependent steps in a parallel program{\textquoteright}s execution. Any increase in the execution time of a step on the critical path will therefore result in an equal increase in the total runtime of the computation. We refine existing critical path-based models and apply them to data-parallel systems, which often share common low-level principles. We provide guidelines on the instrumentation necessary to apply our model, as well as a set of trace properties that help verify the correctness of that instrumentation. Furthermore, we develop a novel method to identify phases in a worker thread{\textquoteright}s execution during which it is waiting {\textemdash} e.g. for a message from a different worker{\textemdash}even in the absence of blocking system calls. Through critical path analysis, we can then identify performance bottlenecks in system components, dataflow operators as well as in network communication.
To demonstrate our ideas, we implemented a prototype system capable of performing a critical path analysis of the Timely Dataflow stream processing engine. We show that our system can effectively identify the factors limiting a data-parallel computation{\textquoteright}s overall performance. Furthermore, we demonstrate that our analysis is both efficient and scalable, and can even be performed in real-time in certain configurations.},
	author = {Ralf Sager},
	school = {154},
	title = {Real-Time Performance Analysis of a Modern Data-Parallel Stream Processing Engine},
	year = {2016}
}
Systems Group Master's Thesis, no. 150; Department of Computer Science, September 2016
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Edona Meta},
	school = {150},
	title = {Question Bias in Repetitive Crowdsourcing Tasks},
	year = {2016}
}
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 2016
@inproceedings{abc,
	author = {Ashutosh Pattnaik and Xulong Tang and Adwait Jog and Onur Kayiran and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Chita R. Das},
	booktitle = {Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel},
	title = {Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities.},
	url = {http://doi.acm.org/10.1145/2967938.2967940},
	year = {2016}
}
Systems Group Master's Thesis, no. 151; Department of Computer Science, September 2016
Supervised by: Prof. Andreas Krause and Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Matteo Pozzetti},
	school = {151},
	title = {Access Path Design for Quality Assurance in Crowdsourcing},
	year = {2016}
}
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 2016
@inproceedings{abc,
	author = {Onur Kayiran and Adwait Jog and Ashutosh Pattnaik and Rachata Ausavarungnirun and Xulong Tang and Mahmut T. Kandemir and Gabriel H. Loh and Onur Mutlu and Chita R. Das},
	booktitle = {Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel},
	title = {{\^I}{\textonequarter}C-States: Fine-grained GPU Datapath Power Management.},
	url = {http://doi.acm.org/10.1145/2967938.2967941},
	year = {2016}
}
Systems Group Master's Thesis, no. 152; Department of Computer Science, September 2016
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Denny Lin},
	school = {152},
	title = {Blackboxing Performance Monitoring Units},
	year = {2016}
}
Systems Group Master's Thesis, no. 153; Department of Computer Science, September 2016
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Andrei Parvu},
	school = {153},
	title = {Program Trace Capture and Analysis for ARM},
	year = {2016}
}
54th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2016, Monticello, IL, USA, September 2016
@inproceedings{abc,
	author = {Ioannis Mitliagkas and Ce Zhang and Stefan Hadjis and Christopher R{\'e}},
	booktitle = {54th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2016, Monticello, IL, USA},
	title = {Asynchrony begets momentum, with an application to deep learning.},
	url = {http://dx.doi.org/10.1109/ALLERTON.2016.7852343},
	year = {2016}
}
26th International Conference on Field Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland, August 2016
@inproceedings{abc,
	author = {Kaan Kara and Gustavo Alonso},
	booktitle = {26th International Conference on Field Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland},
	title = {Fast and robust hashing for database operators.},
	url = {http://dx.doi.org/10.1109/FPL.2016.7577353},
	year = {2016}
}
26th International Conference on Field Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland, August 2016
@inproceedings{abc,
	author = {David Sidler and Zsolt Istv{\'a}n and Gustavo Alonso},
	booktitle = {26th International Conference on Field Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland},
	title = {Low-latency TCP/IP stack for data center applications.},
	url = {http://dx.doi.org/10.1109/FPL.2016.7577319},
	year = {2016}
}
Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys '16, Hong Kong, China, August 2016
@inproceedings{abc,
	author = {Gerd Zellweger and Denny Lin and Timothy Roscoe},
	booktitle = {Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys {\textquoteright}16, Hong Kong, China},
	title = {So many performance events, so little time.},
	url = {http://doi.acm.org/10.1145/2967360.2967375},
	year = {2016}
}
IEEE Micro, July 2016
Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities, such as lossless transmission and remote direct memory access, that are now ubiquitous in high-performance systems. Prototypes of next-generation network cards now offer new features that facilitate device programming. In this article, the authors discuss an abstract machine model for offloading architectures. They used the Portals 4 network interface to implement the proposed abstraction model, and they present two microbenchmarks to show the effects of fully offloaded collective communications. They then propose the concept of persistent offloaded operations that can reduce the creation/offloading overhead, and they discuss a possible extension to the current Portals 4 interface to enable their support. The results obtained show how this work can be used to accelerate existing MPI applications.
@article{abc,
	abstract = {Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities, such as lossless transmission and remote direct memory access, that are now ubiquitous in high-performance systems. Prototypes of next-generation network cards now offer new features that facilitate device programming. In this article, the authors discuss an abstract machine model for offloading architectures. They used the Portals 4 network interface to implement the proposed abstraction model, and they present two microbenchmarks to show the effects of fully offloaded collective communications. They then propose the concept of persistent offloaded operations that can reduce the creation/offloading overhead, and they discuss a possible extension to the current Portals 4 interface to enable their support. The results obtained show how this work can be used to accelerate existing MPI applications.},
	author = {Salvatore Di Girolamo and Pierre Jolivet and Keith D. Underwood and Torsten Hoefler},
	pages = {6-17},
	journal = {IEEE Micro},
	title = {Exploiting Offload Enabled Network Interfaces},
	volume = {36},
	year = {2016}
}
Test, July 2016
books do not have abstracts
@book{abc,
	abstract = {books do not have abstracts},
	note = {Test},
	title = {Test Book Edit},
	year = {2016}
}
Systems Group Master's Thesis, no. 149; Department of Computer Science, July 2016
Supervised by: Prof. Timothy Roscoe
Programmable networks are quickly gaining popularity in both academic research and industry, as they provide a new way to deal with network management. Several deployments by industry including Microsoft and Google already demonstrate the advantages of such paradigm shifts. Despite the initial momentum in the development of logically centralized control logic, i.e. the network controller, a shift towards proprietary industry-driven solutions is observed. Part of the problem is that open source academic-backed controllers have difficulties to scale at the level required by industry operation, particularly when it comes to the speed of operation. This thesis investigates a new approach towards the controller platform by adopting a dataflow processing framework as the computational foundation. At the same time it introduces the base of a more formal way to reason about network management. Specifically, the thesis builds around a representation of the network as a graph which allows us to specify high-level configuration policies as constraints on top of this graph and use well-understood graph computations in a data-parallel and incremental fashion to calculate the network routing. Our results demonstrate a very competitive performance of the routing module even before potential optimizations are conducted. Furthermore, interaction with the system is intuitive and human-friendly thanks to the higher-level policies we introduce.
@mastersthesis{abc,
	abstract = {Programmable networks are quickly gaining popularity in both academic research and industry, as they provide a new way to deal with network management. Several deployments by industry including Microsoft and Google already demonstrate the advantages of such paradigm shifts. Despite the initial momentum in the development of logically centralized control logic, i.e. the network controller, a shift towards proprietary industry-driven solutions is observed. Part of the problem is that open source academic-backed controllers have difficulties to scale at the level required by industry operation, particularly when it comes to the speed of operation. This thesis investigates a new approach towards the controller platform by adopting a dataflow processing framework as the computational foundation. At the same time it introduces the base of a more formal way to reason about network
management. Specifically, the thesis builds around a representation of the network as a graph which allows us to specify high-level configuration policies as constraints on top of this graph and use well-understood graph computations in a data-parallel and incremental fashion to calculate the network routing. Our results demonstrate a very competitive performance of the routing module even before potential optimizations are conducted. Furthermore, interaction with the system is intuitive and human-friendly thanks to the higher-level policies we introduce.},
	author = {Christian  St{\"u}ckelberger },
	school = {149},
	title = {Expressing the Routing Logic of a SDN Controller as a Differential Dataflow},
	year = {2016}
}
Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 2016
@inproceedings{abc,
	author = {Markus Pilman and Martin Kaufmann and Florian K{\"o}hl and Donald Kossmann and Damien Profeta},
	booktitle = {Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA},
	title = {ParTime: Parallel Temporal Aggregation.},
	url = {http://doi.acm.org/10.1145/2882903.2903732},
	year = {2016}
}
Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Austin, TX, USA, June 2016
@inproceedings{abc,
	author = {Wayne P. Burleson and Onur Mutlu and Mohit Tiwari},
	booktitle = {Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Austin, TX, USA},
	title = {Invited - Who is the major threat to tomorrow{\textquoteright}s security?: you, the hardware designer.},
	url = {http://doi.acm.org/10.1145/2897937.2905022},
	year = {2016}
}
Proceedings of the 12th International Workshop on Data Management on New Hardware, DaMoN 2016, San Francisco, CA, USA, June 2016
@inproceedings{abc,
	author = {Jana Giceva and Gerd Zellweger and Gustavo Alonso and Timothy Roscoe},
	booktitle = {Proceedings of the 12th International Workshop on Data Management on New Hardware, DaMoN 2016, San Francisco, CA, USA},
	title = {Customized OS support for data-processing.},
	url = {http://doi.acm.org/10.1145/2933349.2933351},
	year = {2016}
}
Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France, June 2016
@inproceedings{abc,
	author = {Kevin K. Chang and Abhijith Kashyap and Hasan Hassan and Saugata Ghose and Kevin Hsieh and Donghyuk Lee and Tianshi Li and Gennady Pekhimenko and Samira Manabi Khan and Onur Mutlu},
	booktitle = {Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France},
	title = {Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization.},
	url = {http://doi.acm.org/10.1145/2896377.2901453},
	year = {2016}
}
46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016, Toulouse, France, June 2016
@inproceedings{abc,
	author = {Samira Manabi Khan and Donghyuk Lee and Onur Mutlu},
	booktitle = {46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016, Toulouse, France},
	title = {PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM.},
	url = {http://dx.doi.org/10.1109/DSN.2016.30},
	year = {2016}
}
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, Kyoto, Japan, June 2016
Lossless interconnection networks are omnipresent in high performance computing systems, data centers and network-on-chip architectures. Such networks require efficient and deadlock-free routing functions to utilize the available hardware. Topology-aware routing functions become increasingly inapplicable, due to irregular topologies, which either are irregular by design or as a result of hardware failures. Existing topology-agnostic routing methods either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. We propose a novel topology-agnostic routing approach which implicitly avoids deadlocks during the path calculation instead of solving both problems separately. We present a model implementation, called Nue, of a destination-based and oblivious routing function. Nue routing heuristically optimizes the load balancing while enforcing deadlock-freedom without exceeding a given number of virtual channels, which we demonstrate based on the InfiniBand architecture.
@inproceedings{abc,
	abstract = {Lossless interconnection networks are omnipresent in high performance computing systems, data centers and network-on-chip architectures. Such networks require efficient and deadlock-free routing functions to utilize the available hardware. Topology-aware routing functions become increasingly inapplicable, due to irregular topologies, which either are irregular by design or as a result of hardware failures. Existing topology-agnostic routing methods either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. We propose a novel topology-agnostic routing approach which implicitly avoids deadlocks during the path calculation instead of solving both problems separately. We present a model implementation, called Nue, of a destination-based and oblivious routing function. Nue routing heuristically optimizes the load balancing while enforcing deadlock-freedom without exceeding a given number of virtual channels, which we demonstrate based on the InfiniBand architecture.},
	author = {Jens Domke and Torsten Hoefler and Satoshi Matsuoka},
	booktitle = {Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing},
	title = {Routing on the Dependency Graph: A New Approach to Deadlock-Free High-Performance Routing},
	venue = {Kyoto, Japan},
	year = {2016}
}
Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France, June 2016
@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Ashutosh Pattnaik and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, Antibes Juan-Les-Pins, France},
	title = {Exploiting Core Criticality for Enhanced GPU Performance.},
	url = {http://doi.acm.org/10.1145/2896377.2901468},
	year = {2016}
}
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, Kyoto, Japan, June 2016
We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.
@inproceedings{abc,
	abstract = {We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81\% and 73\%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.},
	author = {Patrick Schmid and Maciej Besta and Torsten Hoefler},
	booktitle = {Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing},
	title = {High-Performance Distributed RMA Locks},
	venue = {Kyoto, Japan},
	year = {2016}
}
Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 2016
DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of "big data." DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.
@inproceedings{abc,
	abstract = {DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of "big data." DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.},
	author = {Ce Zhang and Jaeho Shin and Christopher R{\'e} and Michael J. Cafarella and Feng Niu},
	booktitle = {Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016},
	title = {Extracting Databases from Dark Data with DeepDive.},
	url = {http://doi.acm.org/10.1145/2882903.2904442},
	venue = {San Francisco, CA, USA},
	year = {2016}
}
43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 2016
@inproceedings{abc,
	author = {Kevin Hsieh and Eiman Ebrahimi and Gwangsun Kim and Niladrish Chatterjee and Mike O{\textquoteright}Connor and Nandita Vijaykumar and Onur Mutlu and Stephen W. Keckler},
	booktitle = {43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea},
	title = {Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ISCA.2016.27},
	year = {2016}
}
Proceedings of the 2016 International Conference on Supercomputing, Istanbul, Turkey, June 2016
Programming today's increasingly complex heterogeneous hardware is difficult, as it commonly requires the use of data-parallel languages, pragma annotations, specialized libraries, or DSL compilers. Adding explicit accelerator support into a larger code base is not only costly, but also introduces additional complexity that hinders long-term maintenance. We propose a new heterogeneous compiler that brings us closer to the dream of automatic accelerator mapping. Starting from a sequential compiler IR, we automatically generate a hybrid executable that - in combination with a new data management system - transparently offloads suitable code regions. Our approach is almost regression free for a wide range of applications while improving a range of compute kernels as well as two full SPEC CPU applications. We expect our work to reduce the initial cost of accelerator usage and to free developer time to investigate algorithmic changes.
@inproceedings{abc,
	abstract = {Programming today{\textquoteright}s increasingly complex heterogeneous hardware is difficult, as it commonly requires the use of data-parallel languages, pragma annotations, specialized libraries, or DSL compilers. Adding explicit accelerator support into a larger code base is not only costly, but also introduces additional complexity that hinders long-term maintenance. We propose a new heterogeneous compiler that brings us closer to the dream of automatic accelerator mapping. Starting from a sequential compiler IR, we automatically generate a hybrid executable that - in combination with a new data management system - transparently offloads suitable code regions. Our approach is almost regression free for a wide range of applications while improving a range of compute kernels as well as two full SPEC CPU applications. We expect our work to reduce the initial cost of accelerator usage and to free developer time to investigate algorithmic changes.},
	author = {Tobias Grosser and Torsten Hoefler},
	booktitle = {Proceedings of the 2016 International Conference on Supercomputing},
	title = {Polly-ACC: Transparent compilation to heterogeneous hardware},
	venue = {Istanbul, Turkey},
	year = {2016}
}
Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 2016
@inproceedings{abc,
	author = {Pratanu Roy and Arijit Khan and Gustavo Alonso},
	booktitle = {Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA},
	title = {Augmented Sketch: Faster and More Accurate Stream Processing.},
	url = {http://doi.acm.org/10.1145/2882903.2882948},
	year = {2016}
}
43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 2016
@inproceedings{abc,
	author = {Milad Hashemi and Khubaib and Eiman Ebrahimi and Onur Mutlu and Yale N. Patt},
	booktitle = {43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea},
	title = {Accelerating Dependent Cache Misses with an Enhanced Memory Controller.},
	url = {http://dx.doi.org/10.1109/ISCA.2016.46},
	year = {2016}
}
Proceedings of the Platform for Advanced Scientific Computing Conference, Lausanne, Switzerland, June 2016
We discuss the paper selection process of the ACM PASC16 conference. The conference spans multiple scienti fic fi elds used to very diff erent publication cultures. We aim to combine the strengths of the conference and journal publication schemes in order to design an attractive high-quality publication venue for works in large-scale computational science. We use four non-standard key ideas to design a paper selection process for ACM PASC16: (1) no pre-selected committee, (2) short revision process, (3) full double-blindness, and (4) suggested expert reviews. In this overview, we describe our observations of the process and anlayse the data in an attempt to characterize the e ffectiveness of the mechanisms used. We believe that the adoption of some or all of these ideas could prove benefi cial beyond ACM PASC16.
@inproceedings{abc,
	abstract = {We discuss the paper selection process of the ACM PASC16 conference. The conference spans multiple scientific fields used to very different publication cultures. We aim to combine the strengths of the conference and journal publication schemes in order to design an attractive high-quality publication venue for works in large-scale computational science. We use four non-standard key ideas to design a paper selection process for ACM PASC16: (1) no pre-selected committee, (2) short revision process, (3) full double-blindness, and (4) suggested expert reviews. In this overview, we describe our observations of the process and anlayse the data in an attempt to characterize the effectiveness of the mechanisms used. We believe that the adoption of some or all of these ideas could prove beneficial beyond ACM PASC16.},
	author = {Torsten Hoefler},
	booktitle = {Proceedings of the Platform for Advanced Scientific Computing Conference},
	title = {Selecting Technical Papers for an Interdisciplinary Conference: The PASC Review Process},
	venue = {Lausanne, Switzerland},
	year = {2016}
}
32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 2016
@inproceedings{abc,
	author = {Arijit Khan and Benjamin Zehnder and Donald Kossmann},
	booktitle = {32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland},
	title = {Revenue maximization by viral marketing: A social network host{\textquoteright}s perspective.},
	url = {http://dx.doi.org/10.1109/ICDE.2016.7498227},
	year = {2016}
}
24th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2016, Washington, DC, USA, May 2016
@inproceedings{abc,
	author = {Zsolt Istv{\'a}n and David Sidler and Gustavo Alonso},
	booktitle = {24th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2016, Washington, DC, USA},
	title = {Runtime Parameterizable Regular Expression Operators for Databases.},
	url = {http://doi.ieeecomputersociety.org/10.1109/FCCM.2016.61},
	year = {2016}
}
Systems Group Master's Thesis, no. 146; Department of Computer Science, May 2016
Supervised by: Prof. Dr. Timothy Roscoe
@mastersthesis{abc,
	author = {Claudio Foellmi},
	school = {146},
	title = {OS Development in Rust},
	year = {2016}
}
ETH Zürich, Diss. Nr. 23474, May 2016
Supervised by: Prof. Dr. Timothy Roscoe
@phdthesis{abc,
	author = {Pravin Shinde},
	school = {23474},
	title = {Rethinking host network stack architecture using a dataflow modeling approach },
	year = {2016}
}
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, Atlanta, GA, USA, April 2016
@inproceedings{abc,
	author = {Izzat El Hajj and Alex Merritt and Gerd Zellweger and Dejan S. Milojicic and Reto Achermann and Paolo Faraboschi and Wen-Mei W. Hwu and Timothy Roscoe and Karsten Schwan},
	booktitle = {Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS {\textquoteright}16, Atlanta, GA, USA},
	title = {SpaceJMP: Programming with Multiple Virtual Address Spaces.},
	url = {http://doi.acm.org/10.1145/2872362.2872366},
	year = {2016}
}
2016 IEEE International Conference on Cloud Engineering, IC2E 2016, Berlin, Germany, April 2016
@inproceedings{abc,
	author = {Gustavo Alonso},
	booktitle = {2016 IEEE International Conference on Cloud Engineering, IC2E 2016, Berlin, Germany},
	title = {Generalization versus Specialization in Cloud Computing Infrastructures.},
	url = {http://dx.doi.org/10.1109/IC2E.2016.50},
	year = {2016}
}
Systems Group Master's Thesis, no. 144; Department of Computer Science, April 2016
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Andrea Lattuada},
	school = {144},
	title = {Programmable scheduling in a stream processing system},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Gennady Pekhimenko and Evgeny Bolotin and Nandita Vijaykumar and Onur Mutlu and Todd C. Mowry and Stephen W. Keckler},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {A case for toggle-aware compression for GPU systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446064},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Hasan Hassan and Gennady Pekhimenko and Nandita Vijaykumar and Vivek Seshadri and Donghyuk Lee and Oguz Ergin and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {ChargeCache: Reducing DRAM latency by exploiting row access locality.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446096},
	year = {2016}
}
13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016, Santa Clara, CA, USA, March 2016
Consensus mechanisms for ensuring consistency are some of the most expensive operations in managing large amounts of data. Often, there is a trade off that involves reducing the coordination overhead at the price of accepting possible data loss or inconsistencies. As the demand for more efficient data centers increases, it is important to provide better ways of ensuring consistency without affecting performance. In this paper we show that consensus (atomic broadcast) can be removed from the critical path of performance by moving it to hardware. As a proof of concept, we implement Zookeeper’s atomic broadcast at the network level using an FPGA. Our design uses both TCP and an application specific network protocol. The design can be used to push more value into the network, e.g., by extending the functionality of middleboxes or adding inexpensive consensus to in-network processing nodes. To illustrate how this hardware consensus can be used in practical systems, we have combined it with a mainmemory key value store running on specialized microservers (built as well on FPGAs). This results in a distributed service similar to Zookeeper that exhibits high and stable performance. This work can be used as a blueprint for further specialized designs.
@inproceedings{abc,
	abstract = {Consensus mechanisms for ensuring consistency are some of the most expensive operations in managing large amounts of data. Often, there is a trade off that involves reducing the coordination overhead at the price of accepting possible data loss or inconsistencies. As the demand for more efficient data centers increases, it is important to provide better ways of ensuring consistency without affecting performance.

In this paper we show that consensus (atomic broadcast) can be removed from the critical path of performance by moving it to hardware. As a proof of concept, we implement Zookeeper{\textquoteright}s atomic broadcast at the network level using an FPGA. Our design uses both TCP and an application specific network protocol. The design can be used to push more value into the network, e.g., by extending the functionality of middleboxes or adding inexpensive consensus to in-network processing nodes.

To illustrate how this hardware consensus can be used in practical systems, we have combined it with a mainmemory key value store running on specialized microservers (built as well on FPGAs). This results in a distributed service similar to Zookeeper that exhibits high and stable performance. This work can be used as a blueprint for further specialized designs.},
	author = {Zsolt Istv{\'a}n and David Sidler and Gustavo Alonso and Marko Vukolic},
	booktitle = {13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016},
	title = {Consensus in a Box: Inexpensive Coordination in Hardware.},
	url = {https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/istvan},
	venue = {Santa Clara, CA, USA},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Yang Li and Di Wang and Saugata Ghose and Jie Liu and Sriram Govindan and Sean James and Eric Peterson and John Siegler and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {SizeCap: Efficiently handling power surges in fuel cell powered data centers.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446085},
	year = {2016}
}
2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 2016
@inproceedings{abc,
	author = {Kevin K. Chang and Prashant J. Nair and Donghyuk Lee and Saugata Ghose and Moinuddin K. Qureshi and Onur Mutlu},
	booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain},
	title = {Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM.},
	url = {http://dx.doi.org/10.1109/HPCA.2016.7446095},
	year = {2016}
}
Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, Bordeaux, France, March 15-16, 2016., March 2016
@inproceedings{abc,
	author = {Gustavo Alonso},
	booktitle = {Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France},
	title = {Data Processing in Modern Hardware.},
	url = {http://dx.doi.org/10.5441/002/edbt.2016.03},
	venue = {Bordeaux, France, March 15-16, 2016.},
	year = {2016}
}
The International Journal of High Performance Computing Applications, February 2016
Relaxed synchronization offers the potential for maintaining application scalability, by allowing many processes to make independent progress when some processes suffer delays. Yet the benefits of this approach for important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise-mitigation effects of idealized nonblocking collectives, in workloads where these collectives are a major contributor to total execution time. Although nonblocking collectives are unlikely to provide significant noise mitigation to applications in the low operating system noise environments expected in next-generation high-performance computing systems, we show that they can potentially improve application runtime with respect to other noise types.
@article{abc,
	abstract = {Relaxed synchronization offers the potential for maintaining application scalability, by allowing many processes to make independent progress when some processes suffer delays. Yet the benefits of this approach for important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise-mitigation effects of idealized nonblocking collectives, in workloads where these collectives are a major contributor to total execution time. Although nonblocking collectives are unlikely to provide significant noise mitigation to applications in the low operating system noise environments expected in next-generation high-performance computing systems, we show that they can potentially improve application runtime with respect to other noise types.},
	author = {Patrick M. Widener and Scott Levy and Kurt B. Ferreira and Torsten Hoefler},
	pages = {121-133},
	journal = {The International Journal of High Performance Computing Applications},
	title = {On noise and the performance benefit of nonblocking collectives},
	volume = {30},
	year = {2016}
}
Parallel Computing, January 2016
@article{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	journal = {Parallel Computing},
	title = {A case for hierarchical rings with deflection routing: An energy-efficient on-chip communication substrate.},
	url = {http://dx.doi.org/10.1016/j.parco.2016.01.009},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Besmira Nushi and Ece Kamar and Eric Horvitz and Donald Kossmann},
	journal = {CoRR},
	title = {On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems.},
	url = {http://arxiv.org/abs/1611.08309},
	year = {2016}
}
ACM Trans. Comput. Syst., January 2016
@inproceedings{abc,
	author = {Simon Peter and Jialin Li and Irene Zhang and Dan R. K. Ports and Doug Woos and Arvind Krishnamurthy and Thomas E. Anderson and Timothy Roscoe},
	booktitle = {ACM Trans. Comput. Syst.},
	title = {Arrakis: The Operating System Is the Control Plane.},
	url = {http://doi.acm.org/10.1145/2812806},
	year = {2016}
}
Real-Time Systems, January 2016
@inproceedings{abc,
	author = {Hyoseung Kim and Dionisio de Niz and Bj{\"o}rn Andersson and Mark H. Klein and Onur Mutlu and Ragunathan Rajkumar},
	booktitle = {Real-Time Systems},
	title = {Bounding and reducing memory interference in COTS-based multi-core systems.},
	url = {http://dx.doi.org/10.1007/s11241-016-9248-1},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Anja Gruenheid and Donald Kossmann and Divesh Srivastava},
	journal = {CoRR},
	title = {Online Event Integration with StoryPivot.},
	url = {http://arxiv.org/abs/1610.07732},
	year = {2016}
}
CoRR, January 2016
We present ZipML, the first framework for training dense generalized linear models using end-to-end low-precision representation--in ZipML, all movements of data, including those for input samples, model, and gradients, are represented using as little as two bits per component. Within our framework, we have successfully compressed, separately, the input data by 16x, gradient by 16x, and model by 16x while still getting the same training result. Even for the most challenging datasets, we find that robust convergence can be ensured using only an end-to-end 8-bit representation or a 6-bit representation if only samples are quantized. Our work builds on previous research on using low-precision representations for gradient and model in the context of stochastic gradient descent. Our main technical contribution is a new set of techniques which allow the training samples to be processed with low precision, without affecting the convergence of the algorithm. In turn, this leads to a system where all data items move in a quantized, low precision format. In particular, we first establish that randomized rounding, while sufficient when quantizing the model and the gradients, is biased when quantizing samples, and thus leads to a different training result. We propose two new data representations which converge to the same solution as in the original data representation both in theory and empirically and require as little as 2-bits per component. As a result, if the original data is stored as 32-bit floats, we decrease the bandwidth footprint for each training iteration by up to 16x. Our results hold for models such as linear regression and least squares SVM. ZipML raises interesting theoretical questions related to the robustness of SGD to approximate data, model, and gradient representations. We conclude this working paper by a description of ongoing work extending these preliminary results.
@article{abc,
	abstract = {We present ZipML, the first framework for training dense generalized linear models using end-to-end low-precision representation--in ZipML, all movements of data, including those for input samples, model, and gradients, are represented using as little as two bits per component. Within our framework, we have successfully compressed, separately, the input data by 16x, gradient by 16x, and model by 16x while still getting the same training result. Even for the most challenging datasets, we find that robust convergence can be ensured using only an end-to-end 8-bit representation or a 6-bit representation if only samples are quantized. Our work builds on previous research on using low-precision representations for gradient and model in the context of stochastic gradient descent. Our main technical contribution is a new set of techniques which allow the training samples to be processed with low precision, without affecting the convergence of the algorithm. In turn, this leads to a system where all data items move in a quantized, low precision format. In particular, we first establish that randomized rounding, while sufficient when quantizing the model and the gradients, is biased when quantizing samples, and thus leads to a different training result. We propose two new data representations which converge to the same solution as in the original data representation both in theory and empirically and require as little as 2-bits per component. As a result, if the original data is stored as 32-bit floats, we decrease the bandwidth footprint for each training iteration by up to 16x. Our results hold for models such as linear regression and least squares SVM. ZipML raises interesting theoretical questions related to the robustness of SGD to approximate data, model, and gradient representations. We conclude this working paper by a description of ongoing work extending these preliminary results. },
	author = {Hantian Zhang and Kaan Kara and Jerry Li and Dan Alistarh and Ji Liu and Ce Zhang},
	journal = {CoRR},
	title = {ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models.},
	url = {http://arxiv.org/abs/1611.05402},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Kevin Kai-Wei Chang and Donghyuk Lee and Zeshan Chishti and Alaa R. Alameldeen and Chris Wilkerson and Yoongu Kim and Onur Mutlu},
	journal = {CoRR},
	title = {Reducing Performance Impact of DRAM Refresh by Parallelizing Refreshes with Accesses.},
	url = {http://arxiv.org/abs/1601.06352},
	year = {2016}
}
Computer Architecture Letters, January 2016
@inproceedings{abc,
	author = {Yoongu Kim and Weikun Yang and Onur Mutlu},
	booktitle = {Computer Architecture Letters},
	title = {Ramulator: A Fast and Extensible DRAM Simulator.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2414456},
	year = {2016}
}
IEEE Transactions on Parallel and Distributed Systems, January 2016
The increase in the number of cores per processor and the complexity of memory hierarchies make cache coherence key for programmability of current shared memory systems. However, ignoring its detailed architectural characteristics can harm performance significantly. In order to assist performance-centric programming, we propose a methodology to allow semi-automatic performance tuning with the systematic translation from an algorithm to an analytic performance model for cache line transfers. For this, we design a simple interface for cache line aware optimization, a translation methodology, and a full performance model that exposes the block-based design of caches to middleware designers. We investigate two different architectures to show the applicability of our techniques and methods: the many-core accelerator Intel Xeon Phi and a multi-core processor with a NUMA configuration (Intel Sandy Bridge). We use mathematical optimization techniques to tune synchronization algorithms to the microarchitectures, identifying three techniques to design and optimize data transfers in our model: single-use, single-step broadcast, and private cache lines.
@article{abc,
	abstract = {The increase in the number of cores per processor and the complexity of memory hierarchies make cache coherence key for programmability of current shared memory systems. However, ignoring its detailed architectural characteristics can harm performance significantly. In order to assist performance-centric programming, we propose a methodology to allow semi-automatic performance tuning with the systematic translation from an algorithm to an analytic performance model for cache line transfers. For this, we design a simple interface for cache line aware optimization, a translation methodology, and a full performance model that exposes the block-based design of caches to middleware designers. We investigate two different architectures to show the applicability of our techniques and methods: the many-core accelerator Intel Xeon Phi and a multi-core processor with a NUMA configuration (Intel Sandy Bridge). We use mathematical optimization techniques to tune synchronization algorithms to the microarchitectures, identifying three techniques to design and optimize data transfers in our model: single-use, single-step broadcast, and private cache lines.},
	author = {Sabela Ramos and Torsten Hoefler},
	pages = {2824-2837},
	journal = {IEEE Transactions on Parallel and Distributed Systems},
	title = {Cache Line Aware Algorithm Design for Cache-Coherent Architectures},
	volume = {27},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Ioannis Mitliagkas and Ce Zhang and Stefan Hadjis and Christopher R{\'e}},
	journal = {CoRR},
	title = {Asynchrony begets Momentum, with an Application to Deep Learning.},
	url = {http://arxiv.org/abs/1605.09774},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Donghyuk Lee and Yoongu Kim and Vivek Seshadri and Jamie Liu and Lavanya Subramanian and Onur Mutlu},
	journal = {CoRR},
	title = {Tiered-Latency DRAM (TL-DRAM).},
	url = {http://arxiv.org/abs/1601.06903},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Donghyuk Lee and Yoongu Kim and Gennady Pekhimenko and Samira Manabi Khan and Vivek Seshadri and Kevin Kai-Wei Chang and Onur Mutlu},
	journal = {CoRR},
	title = {Adaptive-Latency DRAM (AL-DRAM).},
	url = {http://arxiv.org/abs/1603.08454},
	year = {2016}
}
IEEE Trans. Parallel Distrib. Syst., January 2016
@inproceedings{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	booktitle = {IEEE Trans. Parallel Distrib. Syst.},
	title = {BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling.},
	url = {http://dx.doi.org/10.1109/TPDS.2016.2526003},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Xinghao Pan and Maximilian Lam and Stephen Tu and Dimitris S. Papailiopoulos and Ce Zhang and Michael I. Jordan and Kannan Ramchandran and Christopher R{\'e} and Benjamin Recht},
	journal = {CoRR},
	title = {CYCLADES: Conflict-free Asynchronous Machine Learning.},
	url = {http://arxiv.org/abs/1605.09721},
	year = {2016}
}
Commun. ACM, January 2016
@inproceedings{abc,
	author = {Daniel J. Abadi and Rakesh Agrawal and Anastasia Ailamaki and Magdalena Balazinska and Philip A. Bernstein and Michael J. Carey and Surajit Chaudhuri and Jeffrey Dean and AnHai Doan and Michael J. Franklin and Johannes Gehrke and Laura M. Haas and Alon Y. Halevy and Joseph M. Hellerstein and Yannis E. Ioannidis and H. V. Jagadish and Donald Kossmann and Samuel Madden and Sharad Mehrotra and Tova Milo and Jeffrey F. Naughton and Raghu Ramakrishnan and Volker Markl and Christopher Olston and Beng Chin Ooi and Christopher R{\'e} and Dan Suciu and Michael Stonebraker and Todd Walter and Jennifer Widom},
	booktitle = {Commun. ACM},
	title = {The Beckman report on database research.},
	url = {http://doi.acm.org/10.1145/2845915},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Yoongu Kim and Ross Daly and Jeremie Kim and Chris Fallin and Ji-Hye Lee and Donghyuk Lee and Chris Wilkerson and Konrad Lai and Onur Mutlu},
	journal = {CoRR},
	title = {RowHammer: Reliability Analysis and Security Implications.},
	url = {http://arxiv.org/abs/1603.00747},
	year = {2016}
}
ETH Zürich, January 2016
@inproceedings{abc,
	author = {Ioannis Mitliagkas and Ce Zhang and Stefan Hadjis and Christopher R{\'e}},
	booktitle = {ETH Z{\"u}rich},
	title = {Asynchrony begets Momentum, with an Application to Deep Learning},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Yixin Luo and Sriram Govindan and Bikash Sharma and Mark Santaniello and Justin Meza and Aman Kansal and Jie Liu and Badriddine M. Khessib and Kushagra Vaid and Onur Mutlu},
	journal = {CoRR},
	title = {Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance.},
	url = {http://arxiv.org/abs/1602.00729},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Stefan Hadjis and Ce Zhang and Ioannis Mitliagkas and Christopher R{\'e}},
	journal = {CoRR},
	title = {Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs.},
	url = {http://arxiv.org/abs/1606.04487},
	year = {2016}
}
Bioinformatics, January 2016
@article{abc,
	author = {Emily K. Mallory and Ce Zhang and Christopher R{\'e} and Russ B. Altman},
	journal = {Bioinformatics},
	title = {Large-scale extraction of gene interactions from full-text literature using DeepDive.},
	url = {http://dx.doi.org/10.1093/bioinformatics/btv476},
	year = {2016}
}
SIGMOD Record, January 2016
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
@article{abc,
	abstract = {The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.},
	author = {Christopher De Sa and Alexander Ratner and Christopher R{\'e} and Jaeho Shin and Feiran Wang and Sen Wu and Ce Zhang},
	journal = {SIGMOD Record},
	title = {DeepDive: Declarative Knowledge Base Construction.},
	url = {http://doi.acm.org/10.1145/2949741.2949756},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	journal = {CoRR},
	title = {Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing.},
	url = {http://arxiv.org/abs/1602.06005},
	year = {2016}
}
PVLDB, January 2016
@inproceedings{abc,
	author = {Zaheer Chothia and John Liagouris and Frank McSherry and Timothy Roscoe},
	booktitle = {PVLDB},
	title = {Explaining Outputs in Modern Data Analytics.},
	url = {http://www.vldb.org/pvldb/vol9/p1137-chothia.pdf},
	year = {2016}
}
IEEE Computer, January 2016
@inproceedings{abc,
	author = {Dejan S. Milojicic and Timothy Roscoe},
	booktitle = {IEEE Computer},
	title = {Outlook on Operating Systems.},
	url = {http://dx.doi.org/10.1109/MC.2016.19},
	year = {2016}
}
Bioinformatics, January 2016
@inproceedings{abc,
	author = {Hongyi Xin and Sunny Nahar and Richard Zhu and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	booktitle = {Bioinformatics},
	title = {Optimal seed solver: optimizing seed selection in read mapping.},
	url = {http://dx.doi.org/10.1093/bioinformatics/btv670},
	year = {2016}
}
CoRR, January 2016
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.
@article{abc,
	abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. 
This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. 
CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. 
We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
	author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Saugata Ghose and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
	journal = {CoRR},
	title = {A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps.},
	url = {http://arxiv.org/abs/1602.01348},
	year = {2016}
}
IEEE Micro, January 2016
@inproceedings{abc,
	author = {Onur Mutlu and Richard A. Belgard and Thomas R. Gross and Norman P. Jouppi and John L. Hennessy and Steven A. Przybylski and Chris Rowen and Yale N. Patt and Wen-Mei W. Hwu and Stephen W. Melvin and Michael Shebanow and Tse-Yu Yeh and Andy Wolfe},
	booktitle = {IEEE Micro},
	title = {Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC Processor.},
	url = {http://dx.doi.org/10.1109/MM.2016.66},
	year = {2016}
}
TACO, January 2016
@inproceedings{abc,
	author = {Amir Yazdanbakhsh and Gennady Pekhimenko and Bradley Thwaites and Hadi Esmaeilzadeh and Onur Mutlu and Todd C. Mowry},
	booktitle = {TACO},
	title = {RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads.},
	url = {http://doi.acm.org/10.1145/2836168},
	year = {2016}
}
CoRR, January 2016
@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Gabriel H. Loh and Mithuna Thottethodi and Yasuko Eckert and Mike O{\textquoteright}Connor and Srilatha Manne and Lisa Hsu and Lavanya Subramanian and Onur Mutlu},
	booktitle = {CoRR},
	title = {Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism.},
	url = {http://arxiv.org/abs/1602.00722},
	year = {2016}
}
TACO, January 2016
@inproceedings{abc,
	author = {Donghyuk Lee and Saugata Ghose and Gennady Pekhimenko and Samira Manabi Khan and Onur Mutlu},
	booktitle = {TACO},
	title = {Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost.},
	url = {http://doi.acm.org/10.1145/2832911},
	year = {2016}
}
PVLDB, January 2016
@inproceedings{abc,
	author = {Darko Makreshanski and Georgios Giannikis and Gustavo Alonso and Donald Kossmann},
	booktitle = {PVLDB},
	title = {MQJoin: Efficient Shared Execution of Main-Memory Joins.},
	url = {http://www.vldb.org/pvldb/vol9/p480-makreshanski.pdf},
	year = {2016}
}
IEEE Journal on Selected Areas in Communications, January 2016
@inproceedings{abc,
	author = {Yixin Luo and Saugata Ghose and Yu Cai and Erich F. Haratsch and Onur Mutlu},
	booktitle = {IEEE Journal on Selected Areas in Communications},
	title = {Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory.},
	url = {http://dx.doi.org/10.1109/JSAC.2016.2603608},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Mohammed Alser and Hasan Hassan and Hongyi Xin and Oguz Ergin and Onur Mutlu and Can Alkan},
	journal = {CoRR},
	title = {GateKeeper: Enabling Fast Pre-Alignment in DNA Short Read Mapping with a New Streaming Accelerator Architecture.},
	url = {http://arxiv.org/abs/1604.01789},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Vivek Seshadri and Donghyuk Lee and Thomas Mullins and Hasan Hassan and Amirali Boroumand and Jeremie Kim and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
	journal = {CoRR},
	title = {Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM.},
	url = {http://arxiv.org/abs/1611.09988},
	year = {2016}
}
TACO, January 2016
@inproceedings{abc,
	author = {Hiroyuki Usui and Lavanya Subramanian and Kevin Kai-Wei Chang and Onur Mutlu},
	booktitle = {TACO},
	title = {DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators.},
	url = {http://doi.acm.org/10.1145/2847255},
	year = {2016}
}
IEEE Micro, January 2016
@article{abc,
	author = {Onur Mutlu and Richard A. Belgard and Nick Tredennick and Mike Schlansker},
	journal = {IEEE Micro},
	title = {The 2014 MICRO Test of Time Award Winners: From 1978 to 1992.},
	url = {http://dx.doi.org/10.1109/MM.2016.7},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Vivek Seshadri and Onur Mutlu},
	journal = {CoRR},
	title = {The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Initialization, Bitwise AND and OR.},
	url = {http://arxiv.org/abs/1610.09603},
	year = {2016}
}
ACM Trans. Database Syst., January 2016
@article{abc,
	author = {Ce Zhang and Arun Kumar and Christopher R{\'e}},
	journal = {ACM Trans. Database Syst.},
	title = {Materialization Optimizations for Feature Selection Workloads.},
	url = {http://doi.acm.org/10.1145/2877204},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Arijit Khan and Gustavo Segovia and Donald Kossmann},
	journal = {CoRR},
	title = {Let{\textquoteright}s Do Smart Routing: For Distributed Graph Querying with Decoupled Storage.},
	url = {http://arxiv.org/abs/1611.03959},
	year = {2016}
}
IEEE Design Test, January 2016
@article{abc,
	author = {Amir Yazdanbakhsh and Bradley Thwaites and Hadi Esmaeilzadeh and Gennady Pekhimenko and Onur Mutlu and Todd C. Mowry},
	journal = {IEEE Design  Test},
	title = {Mitigating the Memory Bottleneck With Approximate Load Value Prediction.},
	url = {http://dx.doi.org/10.1109/MDAT.2015.2504899},
	year = {2016}
}
CoRR, January 2016
@article{abc,
	author = {Donghyuk Lee and Samira Manabi Khan and Lavanya Subramanian and Rachata Ausavarungnirun and Gennady Pekhimenko and Vivek Seshadri and Saugata Ghose and Onur Mutlu},
	journal = {CoRR},
	title = {Reducing DRAM Latency by Exploiting Design-Induced Latency Variation in Modern DRAM Chips.},
	url = {http://arxiv.org/abs/1610.09604},
	year = {2016}
}

2015

Proceedings of the 8th International Workshop on Network on Chip Architectures, NoCArc '15, Waikiki, HI, USA, December 2015
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {Proceedings of the 8th International Workshop on Network on Chip Architectures, NoCArc {\textquoteright}15, Waikiki, HI, USA},
	title = {Rethinking Memory System Design (along with Interconnects).},
	url = {http://doi.acm.org/10.1145/2835512.2835520},
	year = {2015}
}
2015 IEEE Global Communications Conference, GLOBECOM 2015, San Diego, CA, USA, December 2015
@inproceedings{abc,
	author = {Nugman Su and Onur Kaya and Sennur Ulukus and Mutlu Koca},
	booktitle = {2015 IEEE Global Communications Conference, GLOBECOM 2015, San Diego, CA, USA},
	title = {Cooperative Multiple Access under Energy Harvesting Constraints.},
	url = {http://dx.doi.org/10.1109/GLOCOM.2014.7417655},
	year = {2015}
}
Systems Group Master's Thesis, no. 143; Department of Computer Science, December 2015
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Samuel Zehnder},
	school = {143},
	title = {A Scalable Distributed Locking Service using One-Sided Atomic Operations},
	year = {2015}
}
Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, Quebec, Canada, December 2015
@inproceedings{abc,
	author = {Christopher De Sa and Ce Zhang and Kunle Olukotun and Christopher R{\'e}},
	booktitle = {Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015},
	title = {Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width.},
	url = {http://papers.nips.cc/paper/5757-rapidly-mixing-gibbs-sampling-for-a-class-of-factor-graphs-using-hierarchy-width},
	venue = {Montreal, Quebec, Canada},
	year = {2015}
}
Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, Quebec, Canada, December 2015
@inproceedings{abc,
	author = {Christopher De Sa and Ce Zhang and Kunle Olukotun and Christopher R{\'e}},
	booktitle = {Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015},
	title = {Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms.},
	url = {http://papers.nips.cc/paper/5717-taming-the-wild-a-unified-analysis-of-hogwild-style-algorithms},
	venue = {Montreal, Quebec, Canada},
	year = {2015}
}
Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015
@inproceedings{abc,
	author = {Lavanya Subramanian and Vivek Seshadri and Arnab Ghosh and Samira Manabi Khan and Onur Mutlu},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory.},
	url = {http://doi.acm.org/10.1145/2830772.2830803},
	year = {2015}
}
Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015
@inproceedings{abc,
	author = {Vivek Seshadri and Thomas Mullins and Amirali Boroumand and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses.},
	url = {http://doi.acm.org/10.1145/2830772.2830820},
	year = {2015}
}
Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 2015
@inproceedings{abc,
	author = {Jinglei Ren and Jishen Zhao and Samira Manabi Khan and Jongmoo Choi and Yongwei Wu and Onur Mutlu},
	booktitle = {Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA},
	title = {ThyNVM: enabling software-transparent crash consistency in persistent memory systems.},
	url = {http://doi.acm.org/10.1145/2830772.2830802},
	year = {2015}
}
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 2015
@inproceedings{abc,
	author = {Georgios Kathareios and Cyriel Minkenberg and Bogdan Prisacari and Germ{\'a}n Rodr{\'\i}guez and Torsten Hoefler},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA},
	title = {Cost-effective diameter-two topologies: analysis and evaluation.},
	url = {http://doi.acm.org/10.1145/2807591.2807652},
	year = {2015}
}
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 2015
@inproceedings{abc,
	author = {Torsten Hoefler and Roberto Belli},
	booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA},
	title = {Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results.},
	url = {http://doi.acm.org/10.1145/2807591.2807644},
	year = {2015}
}
Proceedings of the Third AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2015, San Diego, California., November 2015
@inproceedings{abc,
	author = {Besmira Nushi and Adish Singla and Anja Gruenheid and Erfan Zamanian and Andreas Krause and Donald Kossmann},
	booktitle = {Proceedings of the Third AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2015},
	title = {Crowd Access Path Optimization: Diversity Matters.},
	url = {http://www.aaai.org/ocs/index.php/HCOMP/HCOMP15/paper/view/11577},
	venue = {San Diego, California.},
	year = {2015}
}
Systems Group Master's Thesis, no. 141; Department of Computer Science, October 2015
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Martynas Pumputis},
	school = {141},
	title = {Message Passing for Programming Languages and Operating Systems},
	year = {2015}
}
2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 2015
@inproceedings{abc,
	author = {Donghyuk Lee and Lavanya Subramanian and Rachata Ausavarungnirun and Jongmoo Choi and Onur Mutlu},
	booktitle = {2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA},
	title = {Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM.},
	url = {http://dx.doi.org/10.1109/PACT.2015.51},
	year = {2015}
}
2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 2015
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Saugata Ghose and Onur Kayiran and Gabriel H. Loh and Chita R. Das and Mahmut T. Kandemir and Onur Mutlu},
	booktitle = {2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA},
	title = {Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance.},
	url = {http://dx.doi.org/10.1109/PACT.2015.38},
	year = {2015}
}
25th International Conference on Field Programmable Logic and Applications, FPL 2015, London, United Kingdom, September 2015
@inproceedings{abc,
	author = {Zsolt Istv{\'a}n and David Sidler and Gustavo Alonso},
	booktitle = {25th International Conference on Field Programmable Logic and Applications, FPL 2015, London, United Kingdom},
	title = {Building a distributed key-value store with FPGA-based microservers.},
	url = {http://dx.doi.org/10.1109/FPL.2015.7293967},
	year = {2015}
}
Systems Group Master's Thesis, no. 139; Department of Computer Science, September 2015
Supervised by: Dr. Arijit Khan
@mastersthesis{abc,
	author = {Andreas M. Nufer},
	school = {139},
	title = {Top-k Reliable Color Set in Uncertain Graphs},
	year = {2015}
}
Systems Group Master's Thesis, no. 136; Department of Computer Science, September 2015
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Roni Haecki},
	school = {136},
	title = {Consensus on a multicore machine},
	year = {2015}
}
Systems Group Master's Thesis, no. 140; Department of Computer Science, September 2015
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Kevin Bocksrocker},
	school = {140},
	title = {Efficient Scan in Log-Structured Memory Data Stores},
	year = {2015}
}
Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS 2015, Vancouver, BC, Canada, September 2015
@inproceedings{abc,
	author = {Mohammad Fattah and Antti Airola and Rachata Ausavarungnirun and Nima Mirzaei and Pasi Liljeberg and Juha Plosila and Siamak Mohammadi and Tapio Pahikkala and Onur Mutlu and Hannu Tenhunen},
	booktitle = {Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS 2015, Vancouver, BC, Canada},
	title = {A Low-Overhead, Fully-Distributed, Guaranteed-Delivery Routing Algorithm for Faulty Network-on-Chips.},
	url = {http://doi.acm.org/10.1145/2786572.2786591},
	year = {2015}
}
PVLDB, August 2015
@inproceedings{abc,
	author = {Alessandra Loro and Anja Gruenheid and Donald Kossmann and Damien Profeta and Philippe Beaudequin},
	booktitle = {PVLDB},
	title = {Indexing and Selecting Hierarchical Business Logic.},
	url = {http://www.vldb.org/pvldb/vol8/p1656-loro.pdf},
	year = {2015}
}
PVLDB, August 2015
@inproceedings{abc,
	author = {Arijit Khan and Lei Chen},
	booktitle = {PVLDB},
	title = {On Uncertain Graphs Modeling and Queries.},
	url = {http://www.vldb.org/pvldb/vol8/p2042-khan.pdf},
	year = {2015}
}
ACM Trans. Program. Lang. Syst., August 2015
@inproceedings{abc,
	author = {Tobias Grosser and Sven Verdoolaege and Albert Cohen},
	booktitle = {ACM Trans. Program. Lang. Syst.},
	title = {Polyhedral AST Generation Is More Than Scanning Polyhedra.},
	url = {http://doi.acm.org/10.1145/2743016},
	year = {2015}
}
CoRR, August 2015
@article{abc,
	author = {Besmira Nushi and Adish Singla and Anja Gruenheid and Erfan Zamanian and Andreas Krause and Donald Kossmann},
	journal = {CoRR},
	title = {Crowd Access Path Optimization: Diversity Matters.},
	url = {http://arxiv.org/abs/1508.01951},
	year = {2015}
}
Systems Group Master's Thesis, no. 133; Department of Computer Science, August 2015
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Alessandro Dovis},
	school = {133},
	title = {Decoupling of Data Processing Systems over RDMA/InfiniBand},
	year = {2015}
}
Systems Group Master's Thesis, no. 135; Department of Computer Science, August 2015
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Gustavo Segovia},
	school = {135},
	title = {Distributed Graph Querying with Decoupled Storage and Smart Routing},
	year = {2015}
}
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 2015
@inproceedings{abc,
	author = {Hastagiri P. Vanchinathan and Andreas Marfurt and Charles-Antoine Robelin and Donald Kossmann and Andreas Krause},
	booktitle = {Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia},
	title = {Discovering Valuable items from Massive Data.},
	url = {http://doi.acm.org/10.1145/2783258.2783360},
	year = {2015}
}
2015 USENIX Annual Technical Conference, USENIX ATC '15, Santa Clara, CA, USA, July 2015
@inproceedings{abc,
	author = {Timothy L. Harris and Stefan Kaestle},
	booktitle = {2015 USENIX Annual Technical Conference, USENIX ATC {\textquoteright}15},
	title = {Callisto-RTS: Fine-Grain Parallel Loops.},
	url = {https://www.usenix.org/conference/atc15/technical-session/presentation/harris},
	venue = {Santa Clara, CA, USA},
	year = {2015}
}
2015 USENIX Annual Technical Conference, USENIX ATC '15, Santa Clara, CA, USA, July 2015
@inproceedings{abc,
	author = {Stefan Kaestle and Reto Achermann and Timothy Roscoe and Timothy L. Harris},
	booktitle = {2015 USENIX Annual Technical Conference, USENIX ATC {\textquoteright}15},
	title = {Shoal: Smart Allocation and Replication of Memory For Parallel Programs.},
	url = {https://www.usenix.org/conference/atc15/technical-session/presentation/kaestle},
	venue = {Santa Clara, CA, USA},
	year = {2015}
}
PVLDB, July 2015
@inproceedings{abc,
	author = {Darko Makreshanski and Justin J. Levandoski and Ryan Stutsman},
	booktitle = {PVLDB},
	title = {To Lock, Swap, or Elide: On the Interplay of Hardware Transactional Memory and Lock-Free Indexing.},
	url = {http://www.vldb.org/pvldb/vol8/p1298-makreshanski.pdf},
	year = {2015}
}
TOPC, July 2015
@article{abc,
	author = {Torsten Hoefler and James Dinan and Rajeev Thakur and Brian W. Barrett and Pavan Balaji and William Gropp and Keith D. Underwood},
	journal = {TOPC},
	title = {Remote Memory Access Programming in MPI-3.},
	url = {http://doi.acm.org/10.1145/2780584},
	year = {2015}
}
ETH Zürich, Diss. Nr. 22441, July 2015
Supervised by: Prof. Donald Kossmann
@phdthesis{abc,
	author = {Simon Loesing},
	school = {22441},
	title = {Architecture for Elastic Database Serv{\^\i}ces},
	year = {2015}
}
2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2015, Samos, Greece, July 2015
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2015, Samos, Greece},
	title = {Rethinking memory system design for data-intensive computing.},
	url = {http://dx.doi.org/10.1109/SAMOS.2015.7363650},
	year = {2015}
}
45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015
@inproceedings{abc,
	author = {Justin Meza and Qiang Wu and Sanjeev Kumar and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field.},
	url = {http://dx.doi.org/10.1109/DSN.2015.57},
	year = {2015}
}
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Sabela Ramos and Torsten Hoefler},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA},
	title = {Cache Line Aware Optimizations for ccNUMA Systems.},
	url = {http://doi.acm.org/10.1145/2749246.2749256},
	year = {2015}
}
45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015
@inproceedings{abc,
	author = {Yu Cai and Yixin Luo and Saugata Ghose and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery.},
	url = {http://dx.doi.org/10.1109/DSN.2015.49},
	year = {2015}
}
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, June 2015
@inproceedings{abc,
	author = {Simon Loesing and Markus Pilman and Thomas Etter and Donald Kossmann},
	booktitle = {Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia},
	title = {On the Design and Scalability of Distributed Shared-Data Databases.},
	url = {http://doi.acm.org/10.1145/2723372.2751519},
	year = {2015}
}
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Marius Poke and Torsten Hoefler},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA},
	title = {DARE: High-Performance State Machine Replication on RDMA Networks.},
	url = {http://doi.acm.org/10.1145/2749246.2749267},
	year = {2015}
}
45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 2015
@inproceedings{abc,
	author = {Moinuddin K. Qureshi and Dae-Hyun Kim and Samira Manabi Khan and Prashant J. Nair and Onur Mutlu},
	booktitle = {45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil},
	title = {AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems.},
	url = {http://dx.doi.org/10.1109/DSN.2015.58},
	year = {2015}
}
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, June 2015
@inproceedings{abc,
	author = {Claude Barthels and Simon Loesing and Gustavo Alonso and Donald Kossmann},
	booktitle = {Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia},
	title = {Rack-Scale In-Memory Join Processing using RDMA.},
	url = {http://doi.acm.org/10.1145/2723372.2750547},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate “assist warps” that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.
@inproceedings{abc,
	abstract = {Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the 
 cores. For example, when a GPU is bottlenecked by the available on-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate {\textquotedblleft}assist warps{\textquotedblright} that execute on GPU cores to perform specific tasks that can improve GPU performance and 
 efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7\% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.},
	author = {Nandita Vijaykumar and Gennady Pekhimenko and Adwait Jog and Abhishek Bhowmick and Rachata Ausavarungnirun and Chita R. Das and Mahmut T. Kandemir and Todd C. Mowry and Onur Mutlu},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture},
	title = {A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.},
	url = {http://doi.acm.org/10.1145/2749469.2750399},
	venue = {Portland, OR, USA},
	year = {2015}
}
CoRR, June 2015
@article{abc,
	author = {Hastagiri P. Vanchinathan and Andreas Marfurt and Charles-Antoine Robelin and Donald Kossmann and Andreas Krause},
	journal = {CoRR},
	title = {Discovering Valuable Items from Massive Data.},
	url = {http://arxiv.org/abs/1506.00935},
	year = {2015}
}
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, June 2015
@inproceedings{abc,
	author = {Lucas Braun and Thomas Etter and Georgios Gasparis and Martin Kaufmann and Donald Kossmann and Daniel Widmer and Aharon Avitzur and Anthony Iliopoulos and Eliezer Levy and Ning Liang},
	booktitle = {Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia},
	title = {Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database.},
	url = {http://doi.acm.org/10.1145/2723372.2742783},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Junwhan Ahn and Sungjoo Yoo and Onur Mutlu and Kiyoung Choi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture.},
	url = {http://doi.acm.org/10.1145/2749469.2750385},
	year = {2015}
}
The 10th International Conference on Future Internet, CFI '15, Seoul, Republic of Korea, June 2015
@inproceedings{abc,
	author = {Taeho Lee and Christos Pappas and Cristina Basescu and Jun Han and Torsten Hoefler and Adrian Perrig},
	booktitle = {The 10th International Conference on Future Internet, CFI {\textquoteright}15, Seoul, Republic of Korea},
	title = {Source-Based Path Selection: The Data Plane Perspective.},
	url = {http://doi.acm.org/10.1145/2775088.2775090},
	year = {2015}
}
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, June 2015
@inproceedings{abc,
	author = {Anja Gruenheid and Donald Kossmann and Theodoros Rekatsinas and Divesh Srivastava},
	booktitle = {Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia},
	title = {StoryPivot: Comparing and Contrasting Story Evolution.},
	url = {http://doi.acm.org/10.1145/2723372.2735356},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Vivek Seshadri and Gennady Pekhimenko and Olatunji Ruwase and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry and Trishul M. Chilimbi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.},
	url = {http://doi.acm.org/10.1145/2749469.2750379},
	year = {2015}
}
Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM '15, La Jolla, CA, USA, June 2015
Bi-temporal databases support system (transaction) and application time, enabling users to query the history as recorded today and as it was known in the past. In this paper, we study windows over both system and application time, i.e., bi-temporal windows. We propose a two-dimensional index that supports one-time and continuous queries over fixed and sliding bi-temporal windows, covering static and streaming data. We demonstrate the advantages of the proposed index compared to the state-of-the-art in terms of query performance, index update overhead and space footprint.
@inproceedings{abc,
	abstract = {Bi-temporal databases support system (transaction) and application time, enabling users to query the history as recorded today and as it was known in the past. In this paper, we study windows over both system and application time, i.e., bi-temporal windows. We propose a two-dimensional index that supports one-time and continuous queries over fixed and sliding bi-temporal windows, covering static and streaming data. We demonstrate the advantages of the proposed index compared to the state-of-the-art in terms of query performance, index update overhead and space footprint.},
	author = {Chang Ge and Martin Kaufmann and Lukasz Golab and Peter M. Fischer and Anil K. Goel},
	booktitle = {Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM {\textquoteright}15},
	title = {Indexing bi-temporal windows.},
	url = {http://doi.acm.org/10.1145/2791347.2791373},
	venue = {La Jolla, CA, USA},
	year = {2015}
}
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, June 2015
@inproceedings{abc,
	author = {Philip A. Bernstein and Sudipto Das and Bailu Ding and Markus Pilman},
	booktitle = {Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia},
	title = {Optimizing Optimistic Concurrency Control for Tree-Structured, Log-Structured Databases.},
	url = {http://doi.acm.org/10.1145/2723372.2737788},
	year = {2015}
}
Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Junwhan Ahn and Sungpack Hong and Sungjoo Yoo and Onur Mutlu and Kiyoung Choi},
	booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA},
	title = {A scalable processing-in-memory accelerator for parallel graph processing.},
	url = {http://doi.acm.org/10.1145/2749469.2750386},
	year = {2015}
}
Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 2015
@inproceedings{abc,
	author = {Tobias Gysi and Tobias Grosser and Torsten Hoefler},
	booktitle = {Proceedings of the 29th ACM on International Conference on Supercomputing, ICS{\textquoteright}15, Newport Beach/Irvine, CA, USA},
	title = {MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures.},
	url = {http://doi.acm.org/10.1145/2751205.2751223},
	year = {2015}
}
Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 2015
@inproceedings{abc,
	author = {Maciej Besta and Torsten Hoefler},
	booktitle = {Proceedings of the 29th ACM on International Conference on Supercomputing, ICS{\textquoteright}15, Newport Beach/Irvine, CA, USA},
	title = {Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations.},
	url = {http://doi.acm.org/10.1145/2751205.2751219},
	year = {2015}
}
Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 2015
@inproceedings{abc,
	author = {Tobias Grosser and Sebastian Pop and Louis-No{\"e}l Pouchet and P. Sadayappan and Sebastian Pop},
	booktitle = {Proceedings of the 29th ACM on International Conference on Supercomputing, ICS{\textquoteright}15, Newport Beach/Irvine, CA, USA},
	title = {Optimistic Delinearization of Parametrically Sized Arrays.},
	url = {http://doi.acm.org/10.1145/2751205.2751248},
	year = {2015}
}
Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Justin Meza and Qiang Wu and Sanjev Kumar and Onur Mutlu},
	booktitle = {Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA},
	title = {A Large-Scale Study of Flash Memory Failures in the Field.},
	url = {http://doi.acm.org/10.1145/2745844.2745848},
	year = {2015}
}
Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 2015
@inproceedings{abc,
	author = {Sergei Shudler and Alexandru Calotoiu and Torsten Hoefler and Alexandre Strube and Felix Wolf},
	booktitle = {Proceedings of the 29th ACM on International Conference on Supercomputing, ICS{\textquoteright}15, Newport Beach/Irvine, CA, USA},
	title = {Exascaling Your Library: Will Your Implementation Meet Your Expectations?},
	url = {http://doi.acm.org/10.1145/2751205.2751216},
	year = {2015}
}
Engineering the Web in the Big Data Era - 15th International Conference, ICWE 2015, Rotterdam, The Netherlands, June 2015
The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framework designed to route tasks across and within online crowds. CrowdSTAR indexes the topic-specific expertise and social features of the crowd contributors and then uses a routing algorithm, which suggests the best sources to ask based on the knowledge vs. availability trade-offs. We experimented with the proposed framework for question and answering scenarios by using two popular social networks as crowd candidates: Twitter and Quora.
@inproceedings{abc,
	abstract = {The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framework designed to route tasks across and within online crowds. CrowdSTAR indexes the topic-specific expertise and social features of the crowd contributors and then uses a routing algorithm, which suggests the best sources to ask based on the knowledge vs. availability trade-offs. We experimented with the proposed framework for question and answering scenarios by using two popular social networks as crowd candidates: Twitter and Quora.},
	author = {Besmira Nushi and Omar Alonso and Martin Hentschel and Vasileios Kandylas},
	booktitle = {Engineering the Web in the Big Data Era - 15th International Conference, ICWE 2015},
	title = {CrowdSTAR: A Social Task Routing Framework for Online Communities.},
	url = {http://dx.doi.org/10.1007/978-3-319-19890-3_15},
	venue = {Rotterdam, The Netherlands},
	year = {2015}
}
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA, June 2015
@inproceedings{abc,
	author = {Maciej Besta and Torsten Hoefler},
	booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA},
	title = {Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages.},
	url = {http://doi.acm.org/10.1145/2749246.2749263},
	year = {2015}
}
ETH Zürich, Diss. Nr. 22450, May 2015
Supervised by: Prof. Donald Kossmann
@phdthesis{abc,
	author = {Tahmineh Sanamrad},
	school = {22450},
	title = {Encrypting Databases in the Cloud Threats and Solutions},
	year = {2015}
}
15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause Ittingen, Switzerland, May 2015
@inproceedings{abc,
	author = {Ionel Gog and Jana Giceva and Malte Schwarzkopf and Kapil Vaswani and Dimitrios Vytiniotis and G. Ramalingam and Manuel Costa and Derek Gordon Murray and Steven Hand and Michael Isard},
	booktitle = {15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause Ittingen, Switzerland},
	title = {Broom: Sweeping Out Garbage Collection from Big Data Systems.},
	url = {https://www.usenix.org/conference/hotos15/workshop-program/presentation/gog},
	year = {2015}
}
IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA, May 2015
@inproceedings{abc,
	author = {Yixin Luo and Yu Cai and Saugata Ghose and Jongmoo Choi and Onur Mutlu},
	booktitle = {IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA},
	title = {WARM: Improving NAND flash memory lifetime with write-hotness aware retention management.},
	url = {http://dx.doi.org/10.1109/MSST.2015.7208284},
	year = {2015}
}
15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause Ittingen, Switzerland, May 2015
@inproceedings{abc,
	author = {Torsten Hoefler and Robert B. Ross and Timothy Roscoe},
	booktitle = {15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause Ittingen, Switzerland},
	title = {Distributing the Data Plane for Remote Storage Access.},
	url = {https://www.usenix.org/conference/hotos15/workshop-program/presentation/hoefler},
	year = {2015}
}
IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA, May 2015
@inproceedings{abc,
	author = {Dongwoo Kang and Seungjae Baek and Jongmoo Choi and Donghee Lee and Sam H. Noh and Onur Mutlu},
	booktitle = {IEEE 31st Symposium on Mass Storage Systems and Technologies, MSST 2015, Santa Clara, CA, USA},
	title = {Amnesic cache management for non-volatile memory.},
	url = {http://dx.doi.org/10.1109/MSST.2015.7208291},
	year = {2015}
}
15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause Ittingen, Switzerland, May 2015
@inproceedings{abc,
	author = {Simon Gerber and Gerd Zellweger and Reto Achermann and Kornilios Kourtis and Timothy Roscoe and Dejan S. Milojicic},
	booktitle = {15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause Ittingen, Switzerland},
	title = {Not Your Parents{\textquoteright} Physical Address Space.},
	url = {https://www.usenix.org/conference/hotos15/workshop-program/presentation/gerber},
	year = {2015}
}
Proceedings of the Fourth Workshop on Data analytics in the Cloud, DanaC 2015, Melbourne, VIC, Australia, May 2015
@inproceedings{abc,
	author = {Stefan Hadjis and Firas Abuzaid and Ce Zhang and Christopher R{\'e}},
	booktitle = {Proceedings of the Fourth Workshop on Data analytics in the Cloud, DanaC 2015, Melbourne, VIC, Australia},
	title = {Caffe con Troll: Shallow Ideas to Speed Up Deep Learning.},
	url = {http://doi.acm.org/10.1145/2799562.2799641},
	year = {2015}
}
23rd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2015, Vancouver, BC, Canada, May 2015
@inproceedings{abc,
	author = {David Sidler and Gustavo Alonso and Michaela Blott and Kimon Karras and Kees A. Vissers and Raymond Carley},
	booktitle = {23rd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2015, Vancouver, BC, Canada},
	title = {Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware.},
	url = {http://dx.doi.org/10.1109/FCCM.2015.12},
	year = {2015}
}
2015 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, Hyderabad, India, May 2015
@inproceedings{abc,
	author = {Roberto Belli and Torsten Hoefler},
	booktitle = {2015 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, Hyderabad, India},
	title = {Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization.},
	url = {http://dx.doi.org/10.1109/IPDPS.2015.30},
	year = {2015}
}
2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPS 2015, Hyderabad, India, May 2015
@inproceedings{abc,
	author = {Torsten Hoefler and Laxmikant V. Kale},
	booktitle = {2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPS 2015, Hyderabad, India},
	title = {HIPS-LSPP Keynotes.},
	url = {http://dx.doi.org/10.1109/IPDPSW.2015.173},
	year = {2015}
}
TRETS, April 2015
@article{abc,
	author = {Zsolt Istv{\'a}n and Gustavo Alonso and Michaela Blott and Kees A. Vissers},
	journal = {TRETS},
	title = {A Hash Table for Line-Rate Data Processing.},
	url = {http://doi.acm.org/10.1145/2629582},
	year = {2015}
}
31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 2015
@inproceedings{abc,
	author = {Martin Kaufmann and Peter M. Fischer and Norman May and Chang Ge and Anil K. Goel and Donald Kossmann},
	booktitle = {31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea},
	title = {Bi-temporal Timeline Index: A data structure for Processing Queries on bi-temporal data.},
	url = {http://dx.doi.org/10.1109/ICDE.2015.7113307},
	year = {2015}
}
31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 2015
@inproceedings{abc,
	author = {Arvind Arasu and Ken Eguro and Manas Joglekar and Raghav Kaushik and Donald Kossmann and Ravishankar Ramamurthy},
	booktitle = {31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea},
	title = {Transaction processing on confidential data using cipherbase.},
	url = {http://dx.doi.org/10.1109/ICDE.2015.7113304},
	year = {2015}
}
IEEE Data Eng. Bull., March 2015
@article{abc,
	author = {Tudor-Ioan Salomie and Gustavo Alonso},
	journal = {IEEE Data Eng. Bull.},
	title = {Scaling Off-the-Shelf Databases with Vela: An approach based on Virtualization and Replication.},
	url = {http://sites.computer.org/debull/A15mar/p58.pdf},
	year = {2015}
}
Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey, March 2015
@inproceedings{abc,
	author = {Hui Wang and Canturk Isci and Lavanya Subramanian and Jongmoo Choi and Depei Qian and Onur Mutlu},
	booktitle = {Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey},
	title = {A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters.},
	url = {http://doi.acm.org/10.1145/2731186.2731202},
	year = {2015}
}
Sixteenth International Symposium on Quality Electronic Design, ISQED 2015, Santa Clara, CA, USA, March 2015
@inproceedings{abc,
	author = {Yu Cai and Ken Mai and Onur Mutlu},
	booktitle = {Sixteenth International Symposium on Quality Electronic Design, ISQED 2015, Santa Clara, CA, USA},
	title = {Comparative evaluation of FPGA and ASIC implementations of bufferless and buffered routing algorithms for on-chip networks.},
	url = {http://dx.doi.org/10.1109/ISQED.2015.7085472},
	year = {2015}
}
21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015
@inproceedings{abc,
	author = {Gennady Pekhimenko and Tyler Huberty and Rui Cai and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Exploiting compressed block size as an indicator of future reuse.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056021},
	year = {2015}
}
21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015
@inproceedings{abc,
	author = {Yu Cai and Yixin Luo and Erich F. Haratsch and Ken Mai and Onur Mutlu},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Data retention in MLC NAND flash memory: Characterization, optimization, and recovery.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056062},
	year = {2015}
}
21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 2015
@inproceedings{abc,
	author = {Donghyuk Lee and Yoongu Kim and Gennady Pekhimenko and Samira Manabi Khan and Vivek Seshadri and Kevin Kai-Wei Chang and Onur Mutlu},
	booktitle = {21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA},
	title = {Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.},
	url = {http://dx.doi.org/10.1109/HPCA.2015.7056057},
	year = {2015}
}
IJHPCA, February 2015
@article{abc,
	author = {Kamil Iskra and Torsten Hoefler},
	journal = {IJHPCA},
	title = {Operating systems and runtime environments on supercomputers.},
	url = {http://dx.doi.org/10.1177/1094342014560666},
	year = {2015}
}
TRETS, February 2015
@article{abc,
	author = {Louis Woods and Gustavo Alonso and Jens Teubner},
	journal = {TRETS},
	title = {Parallelizing Data Processing on FPGAs with Shifter Lists.},
	url = {http://doi.acm.org/10.1145/2629551},
	year = {2015}
}
Systems Group Master's Thesis, no. 128; Department of Computer Science, February 2015
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Philipp Rohr},
	school = {128},
	title = {Temporal Graph Data Management for in-memory Database Systems},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	journal = {CoRR},
	title = {The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity.},
	url = {http://arxiv.org/abs/1504.00390},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Christopher De Sa and Ce Zhang and Kunle Olukotun and Christopher R{\'e}},
	journal = {CoRR},
	title = {Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms.},
	url = {http://arxiv.org/abs/1506.06438},
	year = {2015}
}
Bioinformatics, January 2015
@inproceedings{abc,
	author = {Hongyi Xin and John Greth and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	booktitle = {Bioinformatics},
	title = {Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping.},
	url = {http://dx.doi.org/10.1093/bioinformatics/btu856},
	year = {2015}
}
Computer Languages, Systems Structures, January 2015
@inproceedings{abc,
	author = {Onur {\"U}lgen and Mutlu Avci},
	booktitle = {Computer Languages, Systems  Structures},
	title = {The intelligent memory allocator selector.},
	url = {http://dx.doi.org/10.1016/j.cl.2015.09.003},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Donghyuk Lee and Gennady Pekhimenko and Samira Manabi Khan and Saugata Ghose and Onur Mutlu},
	journal = {CoRR},
	title = {Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface.},
	url = {http://arxiv.org/abs/1506.03160},
	year = {2015}
}
PVLDB, January 2015
@inproceedings{abc,
	author = {Jaeho Shin and Sen Wu and Feiran Wang and Christopher De Sa and Ce Zhang and Christopher R{\'e}},
	booktitle = {PVLDB},
	title = {Incremental Knowledge Base Construction Using DeepDive.},
	url = {http://www.vldb.org/pvldb/vol8/p1310-shin.pdf},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Christopher De Sa and Ce Zhang and Kunle Olukotun and Christopher R{\'e}},
	journal = {CoRR},
	title = {Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width.},
	url = {http://arxiv.org/abs/1510.00756},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Sen Wu and Ce Zhang and Feiran Wang and Christopher R{\'e}},
	journal = {CoRR},
	title = {Incremental Knowledge Base Construction Using DeepDive.},
	url = {http://arxiv.org/abs/1502.00731},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Yuke Zhu and Ce Zhang and Christopher R{\'e} and Li Fei-Fei},
	journal = {CoRR},
	title = {Building a Large-scale Multimodal Knowledge Base for Visual Question Answering.},
	url = {http://arxiv.org/abs/1507.05670},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Yang Li and Jongmoo Choi and Jin Sun and Saugata Ghose and Hui Wang and Justin Meza and Jinglei Ren and Onur Mutlu},
	journal = {CoRR},
	title = {Managing Hybrid Main Memories with a Page-Utility Driven Performance Model.},
	url = {http://arxiv.org/abs/1507.03303},
	year = {2015}
}
Computer Architecture Letters, January 2015
@inproceedings{abc,
	author = {Gennady Pekhimenko and Evgeny Bolotin and Mike O{\textquoteright}Connor and Onur Mutlu and Todd C. Mowry and Stephen W. Keckler},
	booktitle = {Computer Architecture Letters},
	title = {Toggle-Aware Compression for GPUs.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2430853},
	year = {2015}
}
Computer Architecture Letters, January 2015
@inproceedings{abc,
	author = {Vivek Seshadri and Kevin Hsieh and Amirali Boroumand and Donghyuk Lee and Michael A. Kozuch and Onur Mutlu and Phillip B. Gibbons and Todd C. Mowry},
	booktitle = {Computer Architecture Letters},
	title = {Fast Bulk Bitwise AND and OR in DRAM.},
	url = {http://dx.doi.org/10.1109/LCA.2015.2434872},
	year = {2015}
}
CoRR, January 2015
@inproceedings{abc,
	author = {Ankit Singla and Balakrishnan Chandrasekaran and Brighten Godfrey and Bruce M. Maggs},
	booktitle = {CoRR},
	title = {Towards a Speed of Light Internet.},
	url = {http://arxiv.org/abs/1505.03449},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Hiroyuki Usui and Lavanya Subramanian and Kevin Kai-Wei Chang and Onur Mutlu},
	journal = {CoRR},
	title = {SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators.},
	url = {http://arxiv.org/abs/1505.07502},
	year = {2015}
}
IEEE Trans. Knowl. Data Eng., January 2015
@inproceedings{abc,
	author = {Cagri Balkesen and Jens Teubner and Gustavo Alonso and M. Tamer {\"O}zsu},
	booktitle = {IEEE Trans. Knowl. Data Eng.},
	title = {Main-Memory Hash Joins on Modern Processor Architectures.},
	url = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2014.2313874},
	year = {2015}
}
IEEE Trans. Computers, January 2015
@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Jia Guo and Shuai Li and Onur Mutlu},
	booktitle = {IEEE Trans. Computers},
	title = {High-Performance and Lightweight Transaction Support in Flash-Based SSDs.},
	url = {http://dx.doi.org/10.1109/TC.2015.2389828},
	year = {2015}
}
IEEE Micro, January 2015
@inproceedings{abc,
	author = {Onur Mutlu and Richard A. Belgard},
	booktitle = {IEEE Micro},
	title = {Introducing the MICRO Test of Time Awards: Concept, Process, 2014 Winners, and the Future.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2015.32},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Firas Abuzaid and Stefan Hadjis and Ce Zhang and Christopher R{\'e}},
	journal = {CoRR},
	title = {Caffe con Troll: Shallow Ideas to Speed Up Deep Learning.},
	url = {http://arxiv.org/abs/1504.04343},
	year = {2015}
}
CoRR, January 2015
@article{abc,
	author = {Hongyi Xin and Richard Zhu and Sunny Nahar and John Emmons and Gennady Pekhimenko and Carl Kingsford and Can Alkan and Onur Mutlu},
	journal = {CoRR},
	title = {Optimal Seed Solver: Optimizing Seed Selection in Read Mapping.},
	url = {http://arxiv.org/abs/1506.08235},
	year = {2015}
}

2014

VLDB J., December 2014
@inproceedings{abc,
	author = {Carsten Binnig and Stefan Hildenbrand and Franz F{\"a}rber and Donald Kossmann and Juchang Lee and Norman May},
	booktitle = {VLDB J.},
	title = {Distributed snapshot isolation: global transactions pay globally, local transactions pay locally.},
	url = {http://dx.doi.org/10.1007/s00778-014-0359-9},
	year = {2014}
}
47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 2014
@inproceedings{abc,
	author = {Jishen Zhao and Onur Mutlu and Yuan Xie},
	booktitle = {47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom},
	title = {FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems.},
	url = {http://dx.doi.org/10.1109/MICRO.2014.47},
	year = {2014}
}
47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 2014
@inproceedings{abc,
	author = {Onur Kayiran and Nachiappan Chidambaram Nachiappan and Adwait Jog and Rachata Ausavarungnirun and Mahmut T. Kandemir and Gabriel H. Loh and Onur Mutlu and Chita R. Das},
	booktitle = {47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom},
	title = {Managing GPU Concurrency in Heterogeneous Architectures.},
	url = {http://dx.doi.org/10.1109/MICRO.2014.62},
	year = {2014}
}
Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, Quebec, Canada, December 2014
@inproceedings{abc,
	author = {Yingbo Zhou and Utkarsh Porwal and Ce Zhang and Hung Q. Ngo and Long Nguyen and Christopher R{\'e} and Venu Govindaraju},
	booktitle = {Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014},
	title = {Parallel Feature Selection Inspired by Group Testing.},
	url = {http://papers.nips.cc/paper/5296-parallel-feature-selection-inspired-by-group-testing},
	venue = {Montreal, Quebec, Canada},
	year = {2014}
}
PVLDB, November 2014
@inproceedings{abc,
	author = {Jana Giceva and Gustavo Alonso and Timothy Roscoe and Timothy L. Harris},
	booktitle = {PVLDB},
	title = {Deployment of Query Plans on Multicores.},
	url = {http://www.vldb.org/pvldb/vol8/p233-giceva.pdf},
	year = {2014}
}
ETH Zürich, Diss. Nr. 22063, November 2014
Supervised by: Prof. Gustavo Alonso
@phdthesis{abc,
	author = {Georgios Giannikis},
	school = {22063},
	title = {Work Sharing Data Processing Systems},
	year = {2014}
}
PVLDB, November 2014
@inproceedings{abc,
	author = {Louis Woods and Zsolt Istv{\'a}n and Gustavo Alonso},
	booktitle = {PVLDB},
	title = {Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading.},
	url = {http://www.vldb.org/pvldb/vol7/p963-woods.pdf},
	year = {2014}
}
32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014
@inproceedings{abc,
	author = {Lavanya Subramanian and Donghyuk Lee and Vivek Seshadri and Harsha Rastogi and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974655},
	year = {2014}
}
26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 2014
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Chris Fallin and Xiangyao Yu and Kevin Kai-Wei Chang and Greg Nazario and Reetuparna Das and Gabriel H. Loh and Onur Mutlu},
	booktitle = {26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France},
	title = {Design and Evaluation of Hierarchical Rings with Deflection Routing.},
	url = {http://dx.doi.org/10.1109/SBAC-PAD.2014.31},
	year = {2014}
}
11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 2014
@inproceedings{abc,
	author = {Stefan C. M{\"u}ller and Gustavo Alonso and Adam Amara and Andr{\'e} Csillaghy},
	booktitle = {11th USENIX Symposium on Operating Systems Design and Implementation, OSDI {\textquoteright}14, Broomfield, CO, USA},
	title = {Pydron: Semi-Automatic Parallelization for Multi-Core and the Cloud.},
	url = {https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muller},
	year = {2014}
}
2014 Conference on Timely Results in Operating Systems, TRIOS '14, Broomfield, CO, USA, October 2014
@inproceedings{abc,
	author = {Andrew Baumann and Chris Hawblitzel and Kornilios Kourtis and Timothy L. Harris and Timothy Roscoe},
	booktitle = {2014 Conference on Timely Results in Operating Systems, TRIOS {\textquoteright}14, Broomfield, CO, USA},
	title = {Cosh: Clear OS Data Sharing In An Incoherent World.},
	url = {https://www.usenix.org/conference/trios14/technical-sessions/presentation/baumann},
	year = {2014}
}
11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 2014
@inproceedings{abc,
	author = {Gerd Zellweger and Simon Gerber and Kornilios Kourtis and Timothy Roscoe},
	booktitle = {11th USENIX Symposium on Operating Systems Design and Implementation, OSDI {\textquoteright}14, Broomfield, CO, USA},
	title = {Decoupling Cores, Kernels, and Operating Systems.},
	url = {https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zellweger},
	year = {2014}
}
Proceedings of the 13th ACM Workshop on Hot Topics in Networks, HotNets-XIII, Los Angeles, CA, USA, October 2014
For many Internet services, reducing latency improves the user experience and increases revenue for the service provider. While in principle latencies could nearly match the speed of light, we find that infrastructural inefficiencies and protocol overheads cause today's Internet to be much slower than this bound: typically by more than one, and often, by more than two orders of magnitude. Bridging this large gap would not only add value to today's Internet applications, but could also open the door to exciting new applications. Thus, we propose a grand challenge for the networking research community: a speed-of-light Internet. To inform this research agenda, we investigate the causes of latency inflation in the Internet across the network stack. We also discuss a few broad avenues for latency improvement.
@inproceedings{abc,
	abstract = {For many Internet services, reducing latency improves the user experience and increases revenue for the service provider. While in principle latencies could nearly match the speed of light, we find that infrastructural inefficiencies and protocol overheads cause today{\textquoteright}s Internet to be much slower than this bound: typically by more than one, and often, by more than two orders of magnitude. Bridging this large gap would not only add value to today{\textquoteright}s Internet applications, but could also open the door to exciting new applications. Thus, we propose a grand challenge for the networking research community: a speed-of-light Internet. To inform this research agenda, we investigate the causes of latency inflation in the Internet across the network stack. We also discuss a few broad avenues for latency improvement.},
	author = {Ankit Singla and Balakrishnan Chandrasekaran and Brighten Godfrey and Bruce M. Maggs},
	booktitle = {Proceedings of the 13th ACM Workshop on Hot Topics in Networks, HotNets-XIII},
	title = {The Internet at the Speed of Light.},
	url = {http://doi.acm.org/10.1145/2670518.2673876},
	venue = {Los Angeles, CA, USA},
	year = {2014}
}
11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 2014
@inproceedings{abc,
	author = {Simon Peter and Jialin Li and Irene Zhang and Dan R. K. Ports and Doug Woos and Arvind Krishnamurthy and Thomas E. Anderson and Timothy Roscoe},
	booktitle = {11th USENIX Symposium on Operating Systems Design and Implementation, OSDI {\textquoteright}14, Broomfield, CO, USA},
	title = {Arrakis: The Operating System is the Control Plane.},
	url = {https://www.usenix.org/conference/osdi14/technical-sessions/presentation/peter},
	year = {2014}
}
32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014
@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Long Sun and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {Loose-Ordering Consistency for persistent memory.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974684},
	year = {2014}
}
Systems Group Master's Thesis, no. 118; Department of Computer Science, October 2014
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Reto Achermann},
	school = {118},
	title = {Message Passing and Bulk Transport on Heterogeneous Multiprocessors},
	year = {2014}
}
32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 2014
@inproceedings{abc,
	author = {Chris Fallin and Chris Wilkerson and Onur Mutlu},
	booktitle = {32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea},
	title = {The heterogeneous block architecture.},
	url = {http://dx.doi.org/10.1109/ICCD.2014.6974710},
	year = {2014}
}
Systems Group Master's Thesis, no. 115; Department of Computer Science, September 2014
Supervised by: Prof. Dr. Donald Kossmann
@mastersthesis{abc,
	author = {Moritz Hoffmann},
	school = {115},
	title = {Completeness is in the eye of the beholder: A sandbox concept for databases},
	year = {2014}
}
Systems Group Master's Thesis, no. 117; Department of Computer Science, September 2014
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Benjamin Zehnder},
	school = {117},
	title = {Towards Revenue Maximization by Viral Marketing: A Social Network Host\&$\#$146;s Perspective},
	year = {2014}
}
Proceedings of the 2nd International Workshop on In Memory Data Management and Analytics, IMDM 2014, Hangzhou, China, September 2014
@inproceedings{abc,
	author = {Victor Bittorf and Marcel Kornacker and Christopher R{\'e} and Ce Zhang},
	booktitle = {Proceedings of the 2nd International Workshop on In Memory Data Management and Analytics, IMDM 2014, Hangzhou, China},
	title = {Tradeoffs in Main-Memory Statistical Analytics from Impala to DimmWitted.},
	url = {http://www-db.in.tum.de/hosted/imdm2014/papers/bittorf.pdf},
	year = {2014}
}
Systems Group Master's Thesis, no. 119; Department of Computer Science, September 2014
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Raphael Fuchs},
	school = {119},
	title = {Hardware Transactional Memory and Message Passing},
	year = {2014}
}
IEEE Computer, September 2014
@inproceedings{abc,
	author = {Stefan C. Muller and Gustavo Alonso and Andr{\'e} Csillaghy},
	booktitle = {IEEE Computer},
	title = {Scaling Astroinformatics: Python + Automatic Parallelization.},
	url = {http://dx.doi.org/10.1109/MC.2014.262},
	year = {2014}
}
Parallel Processing Letters, September 2014
@inproceedings{abc,
	author = {Tobias Grosser and Sven Verdoolaege and Albert Cohen and P. Sadayappan},
	booktitle = {Parallel Processing Letters},
	title = {The Relation Between Diamond Tiling and Hexagonal Tiling.},
	url = {http://dx.doi.org/10.1142/S0129626414410023},
	year = {2014}
}
International Conference on Parallel Architectures and Compilation, PACT '14, Edmonton, AB, Canada, August 2014
@inproceedings{abc,
	author = {James A. Jablin and Thomas B. Jablin and Onur Mutlu and Maurice Herlihy},
	booktitle = {International Conference on Parallel Architectures and Compilation, PACT {\textquoteright}14, Edmonton, AB, Canada},
	title = {Warp-aware trace scheduling for GPUs.},
	url = {http://doi.acm.org/10.1145/2628071.2628101},
	year = {2014}
}
International Conference on Parallel Architectures and Compilation, PACT '14, Edmonton, AB, Canada, August 2014
@inproceedings{abc,
	author = {Bradley Thwaites and Gennady Pekhimenko and Hadi Esmaeilzadeh and Amir Yazdanbakhsh and Onur Mutlu and Jongse Park and Girish Mururu and Todd C. Mowry},
	booktitle = {International Conference on Parallel Architectures and Compilation, PACT {\textquoteright}14, Edmonton, AB, Canada},
	title = {Rollback-free value prediction with approximate loads.},
	url = {http://doi.acm.org/10.1145/2628071.2628110},
	year = {2014}
}
The VLDB Journal; Nov. 2013, August 2014
@inproceedings{abc,
	author = {Philipp Unterbrunner and Gustavo Alonso and Donald Kossmann},
	booktitle = {The VLDB Journal; Nov. 2013},
	title = {High Availability, Elasticity, and Strong Consistency for Massively Parallel Scans over Relational Data.},
	year = {2014}
}
PVLDB, August 2014
@inproceedings{abc,
	author = {Arijit Khan and Sameh Elnikety},
	booktitle = {PVLDB},
	title = {Systems for Big-Graphs.},
	url = {http://www.vldb.org/pvldb/vol7/p1709-khan.pdf},
	year = {2014}
}
Data and Applications Security and Privacy XXVIII - 28th Annual IFIP WG 11.3 Working Conference, DBSec 2014, Vienna, Austria, July 2014
@inproceedings{abc,
	author = {Tahmineh Sanamrad and Lucas Braun and Donald Kossmann and Ramarathnam Venkatesan},
	booktitle = {Data and Applications Security and Privacy XXVIII - 28th Annual IFIP WG 11.3 Working Conference, DBSec 2014, Vienna, Austria},
	title = {Randomly Partitioned Encryption for Cloud Databases.},
	url = {http://dx.doi.org/10.1007/978-3-662-43936-4_20},
	year = {2014}
}
ETH Zürich, Diss. Nr. 21967, July 2014
Supervised by: Prof. Gustavo Alonso
@phdthesis{abc,
	author = {Louis Woods},
	school = {21967},
	title = {FPGA-Enhanced Data Processing Systems},
	year = {2014}
}
ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, Austin, TX, June 2014
@inproceedings{abc,
	author = {Yu Cai and Gulay Yalcin and Onur Mutlu and Erich F. Haratsch and Osman S. Unsal and Adri{\'a}n Cristal and Ken Mai},
	booktitle = {ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS {\textquoteright}14, Austin, TX},
	title = {Neighbor-cell assisted error correction for MLC NAND flash memories.},
	url = {http://doi.acm.org/10.1145/2591971.2591994},
	year = {2014}
}
International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 2014
@inproceedings{abc,
	author = {Arijit Khan and Pouya Yanki and Bojana Dimcheva and Donald Kossmann},
	booktitle = {International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA},
	title = {Towards indexing functions: answering scalar product queries.},
	url = {http://doi.acm.org/10.1145/2588555.2610493},
	year = {2014}
}
ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, Austin, TX, June 2014
@inproceedings{abc,
	author = {Sangeetha Abdu Jyothi and Ankit Singla and Brighten Godfrey and Alexandra Kolla},
	booktitle = {ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS {\textquoteright}14, Austin, TX},
	title = {Measuring throughput of data center network topologies.},
	url = {http://doi.acm.org/10.1145/2591971.2592040},
	year = {2014}
}
Second International Workshop on Graph Data Management Experiences and Systems, GRADES 2014, co-loated with SIGMOD/PODS 2014, Snowbird, Utah, USA, June 2014
@inproceedings{abc,
	author = {Nandish Jayaram and Arijit Khan and Chengkai Li and Xifeng Yan and Ramez Elmasri},
	booktitle = {Second International Workshop on Graph Data Management Experiences and Systems, GRADES 2014, co-loated with SIGMOD/PODS 2014, Snowbird, Utah, USA},
	title = {Towards a Query-by-Example System for Knowledge Graphs.},
	url = {http://event.cwi.nl/grades2014/11-jayaram.pdf},
	year = {2014}
}
44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 2014
@inproceedings{abc,
	author = {Yixin Luo and Sriram Govindan and Bikash Sharma and Mark Santaniello and Justin Meza and Aman Kansal and Jie Liu and Badriddine M. Khessib and Kushagra Vaid and Onur Mutlu},
	booktitle = {44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA},
	title = {Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory.},
	url = {http://dx.doi.org/10.1109/DSN.2014.50},
	year = {2014}
}
Asia-Pacific Workshop on Systems, APSys'14, Beijing, China, June 2014
@inproceedings{abc,
	author = {Zaheer Chothia and Qin Yin and Timothy Roscoe},
	booktitle = {Asia-Pacific Workshop on Systems, APSys{\textquoteright}14, Beijing, China},
	title = {Grok the data center.},
	url = {http://doi.acm.org/10.1145/2637166.2637234},
	year = {2014}
}
ETH Zürich, Diss. Nr. 21964, June 2014
Supervised by: Prof. Donald Kossmann
@phdthesis{abc,
	author = {Martin Kaufmann},
	school = {21964},
	title = {Storing and Processing Temporal Data in Main Memory Column Stores},
	year = {2014}
}
International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 2014
@inproceedings{abc,
	author = {Ce Zhang and Arun Kumar and Christopher R{\'e}},
	booktitle = {International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA},
	title = {Materialization optimizations for feature selection workloads.},
	url = {http://doi.acm.org/10.1145/2588555.2593678},
	year = {2014}
}
ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 2014
@inproceedings{abc,
	author = {Yoongu Kim and Ross Daly and Jeremie Kim and Chris Fallin and Ji-Hye Lee and Donghyuk Lee and Chris Wilkerson and Konrad Lai and Onur Mutlu},
	booktitle = {ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA},
	title = {Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors.},
	url = {http://dx.doi.org/10.1109/ISCA.2014.6853210},
	year = {2014}
}
ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 2014
@inproceedings{abc,
	author = {Vivek Seshadri and Abhishek Bhowmick and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA},
	title = {The Dirty-Block Index.},
	url = {http://dx.doi.org/10.1109/ISCA.2014.6853204},
	year = {2014}
}
ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, Austin, TX, June 2014
@inproceedings{abc,
	author = {Samira Manabi Khan and Donghyuk Lee and Yoongu Kim and Alaa R. Alameldeen and Chris Wilkerson and Onur Mutlu},
	booktitle = {ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS {\textquoteright}14, Austin, TX},
	title = {The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study.},
	url = {http://doi.acm.org/10.1145/2591971.2592000},
	year = {2014}
}
International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 2014
@inproceedings{abc,
	author = {Zsolt Istv{\'a}n and Louis Woods and Gustavo Alonso},
	booktitle = {International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA},
	title = {Histograms as a side effect of data movement for big data.},
	url = {http://doi.acm.org/10.1145/2588555.2612174},
	year = {2014}
}
PVLDB, May 2014
@inproceedings{abc,
	author = {Pratanu Roy and Jens Teubner and Rainer Gemulla},
	booktitle = {PVLDB},
	title = {Low-Latency Handshake Join.},
	url = {http://www.vldb.org/pvldb/vol7/p709-roy.pdf},
	year = {2014}
}
PVLDB, May 2014
@inproceedings{abc,
	author = {Anja Gruenheid and Xin Dong and Divesh Srivastava},
	booktitle = {PVLDB},
	title = {Incremental Record Linkage.},
	url = {http://www.vldb.org/pvldb/vol7/p697-gruenheid.pdf},
	year = {2014}
}
Systems Group Master's Thesis, no. 112; Department of Computer Science, May 2014
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Bojana Dimcheva},
	school = {112},
	title = {Indexing Scalar Product Queries},
	year = {2014}
}
ETH Zürich, Diss. Nr. 21954, May 2014
Supervised by: Prof. Gustavo Alonso
@phdthesis{abc,
	author = {Cagri Balkesen},
	school = {21954},
	title = {In-Memory Parallel Join Processing on Multi-Core Processors},
	year = {2014}
}
20th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2014, Berlin, Germany, April 2014
@inproceedings{abc,
	author = {Hyoseung Kim and Dionisio de Niz and Bj{\"o}rn Andersson and Mark H. Klein and Onur Mutlu and Ragunathan Rajkumar},
	booktitle = {20th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2014, Berlin, Germany},
	title = {Bounding memory interference delay in COTS-based multi-core systems.},
	url = {http://dx.doi.org/10.1109/RTAS.2014.6925998},
	year = {2014}
}
Workshops Proceedings of the 30th International Conference on Data Engineering Workshops, ICDE 2014, Chicago, IL, USA, April 2014
@inproceedings{abc,
	author = {Christian Tinnefeld and Donald Kossmann and Joos-Hendrik B{\"o}se and Hasso Plattner},
	booktitle = {Workshops Proceedings of the 30th International Conference on Data Engineering Workshops, ICDE 2014, Chicago, IL, USA},
	title = {Parallel join executions in RAMCloud.},
	url = {http://dx.doi.org/10.1109/ICDEW.2014.6818325},
	year = {2014}
}
Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2014, Seattle, WA, USA, April 2014
@inproceedings{abc,
	author = {Ankit Singla and Brighten Godfrey and Alexandra Kolla},
	booktitle = {Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2014, Seattle, WA, USA},
	title = {High Throughput Data Center Topology Design.},
	url = {https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/singla},
	year = {2014}
}
2014 IEEE Conference on Computer Communications, INFOCOM 2014, Toronto, Canada, April 2014
@inproceedings{abc,
	author = {Brent Stephens and Alan L. Cox and Ankit Singla and John B. Carter and Colin Dixon and Wes Felter},
	booktitle = {2014 IEEE Conference on Computer Communications, INFOCOM 2014, Toronto, Canada},
	title = {Practical DCB for improved data center networks.},
	url = {http://dx.doi.org/10.1109/INFOCOM.2014.6848121},
	year = {2014}
}
EDBT: 17th International Conference on Extending Database Technology, March 2014
After more than a decade of a virtual standstill, the adoption of temporal data management features has recently picked up speed, driven by customer demand and the inclusion of temporal expressions into SQL:2011. Most of the big commercial DBMS now include support for bitemporal data and operators. In this paper, we perform a thorough analysis of these commercial temporal DBMS: We investigate their architecture, determine their performance and study the impact of performance tuning. This analysis utilizes our recent (TPCTC 2013) benchmark proposal, which includes a comprehensive temporal workload definition. The results of our analysis show that the support for temporal data is still in its infancy: All systems store their data in regular, statically partitioned tables and rely on standard indexes as well as query rewrites for their operations. As shown by our measurements, this causes considerable performance variations on slight workload variations and a significant effort for performance tuning. In some cases, there is considerable overhead for temporal operations even after extensive tuning.
@inproceedings{abc,
	abstract = {After more than a decade of a virtual standstill, the adoption of temporal data management features has recently picked up speed, driven by customer demand and the inclusion of temporal expressions into SQL:2011. Most of the big commercial DBMS now include support for bitemporal data and operators. 
In this paper, we perform a thorough analysis of these commercial temporal DBMS: We investigate their architecture, determine their performance and study the impact of performance tuning. This analysis utilizes our recent (TPCTC 2013) benchmark proposal, which includes a comprehensive temporal workload definition.
The results of our analysis show that the support for temporal data is still in its infancy:  All systems store their data in regular, statically partitioned tables and rely on standard indexes as well as query rewrites for their operations. As shown by our measurements, this causes considerable performance variations on slight workload variations and a significant effort for performance tuning. In some cases, there is considerable overhead for temporal operations even after extensive tuning.},
	author = {Martin Kaufmann and Peter M. Fischer and Norman May and Donald Kossmann},
	booktitle = {EDBT: 17th International Conference on Extending Database Technology},
	title = {Benchmarking Bitemporal Database Systems:  Ready for the Future or Stuck in the Past?},
	year = {2014}
}
Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 2014
@inproceedings{abc,
	author = {Arijit Khan and Francesco Bonchi and Aristides Gionis and Francesco Gullo},
	booktitle = {Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece},
	title = {Fast Reliability Search in Uncertain Graphs.},
	url = {http://dx.doi.org/10.5441/002/edbt.2014.48},
	year = {2014}
}
Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 2014
@inproceedings{abc,
	author = {Martin Kaufmann and Peter M. Fischer and Norman May and Donald Kossmann},
	booktitle = {Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece},
	title = {Benchmarking Bitemporal Database Systems: Ready for the Future or Stuck in the Past?},
	url = {http://dx.doi.org/10.5441/002/edbt.2014.80},
	year = {2014}
}
Systems Group Master's Thesis, no. 104; Department of Computer Science, February 2014
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Alessandra Loro},
	school = {104},
	title = {Business Rules Retrieval and Processing},
	year = {2014}
}
ETH Zürich, Diss. Nr. 21762, February 2014
Supervised by: Prof. Timothy Roscoe
@phdthesis{abc,
	author = {Ercan Ucan},
	school = {21762},
	title = {Data storage, transfers and communication in personal clouds },
	year = {2014}
}
20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 2014
@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Donghyuk Lee and Zeshan Chishti and Alaa R. Alameldeen and Chris Wilkerson and Yoongu Kim and Onur Mutlu},
	booktitle = {20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA},
	title = {Improving DRAM performance by parallelizing refreshes with accesses.},
	url = {http://dx.doi.org/10.1109/HPCA.2014.6835946},
	year = {2014}
}
20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 2014
@inproceedings{abc,
	author = {Samira Manabi Khan and Alaa R. Alameldeen and Chris Wilkerson and Onur Mutlu and Daniel A. Jim{\'e}nez},
	booktitle = {20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA},
	title = {Improving cache performance using read-write partitioning.},
	url = {http://dx.doi.org/10.1109/HPCA.2014.6835954},
	year = {2014}
}
PVLDB, February 2014
@inproceedings{abc,
	author = {Georgios Giannikis and Darko Makreshanski and Gustavo Alonso and Donald Kossmann},
	booktitle = {PVLDB},
	title = {Shared Workload Optimization.},
	url = {http://www.vldb.org/pvldb/vol7/p429-giannikis.pdf},
	year = {2014}
}
12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, Orlando, FL, USA, February 2014
@inproceedings{abc,
	author = {Tobias Grosser and Albert Cohen and Justin Holewinski and P. Sadayappan and Sven Verdoolaege},
	booktitle = {12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, Orlando, FL, USA},
	title = {Hybrid Hexagonal/Classical Tiling for GPUs.},
	url = {http://doi.acm.org/10.1145/2544137.2544160},
	year = {2014}
}
PVLDB, January 2014
@inproceedings{abc,
	author = {Ce Zhang and Christopher R{\'e}},
	booktitle = {PVLDB},
	title = {DimmWitted: A Study of Main-Memory Statistical Analytics.},
	url = {http://www.vldb.org/pvldb/vol7/p1283-zhang.pdf},
	year = {2014}
}
TACO, January 2014
@inproceedings{abc,
	author = {HanBin Yoon and Justin Meza and Naveen Muralimanohar and Norman P. Jouppi and Onur Mutlu},
	booktitle = {TACO},
	title = {Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories.},
	url = {http://doi.acm.org/10.1145/2669365},
	year = {2014}
}
CoRR, January 2014
@article{abc,
	author = {Shanan Peters and Ce Zhang and Miron Livny and Christopher R{\'e}},
	journal = {CoRR},
	title = {A machine-compiled macroevolutionary history of Phanerozoic life.},
	url = {http://arxiv.org/abs/1406.2963},
	year = {2014}
}
TACO, January 2014
@inproceedings{abc,
	author = {Vivek Seshadri and Samihan Yedkar and Hongyi Xin and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {TACO},
	title = {Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks.},
	url = {http://doi.acm.org/10.1145/2677956},
	year = {2014}
}
IEEE Data Eng. Bull., January 2014
@inproceedings{abc,
	author = {Christopher R{\'e} and Amir Abbas Sadeghian and Zifei Shan and Jaeho Shin and Feiran Wang and Sen Wu and Ce Zhang},
	booktitle = {IEEE Data Eng. Bull.},
	title = {Feature Engineering for Knowledge Base Construction.},
	url = {http://sites.computer.org/debull/A14sept/p26.pdf},
	year = {2014}
}
ETH Zürich, Diss. Nr. 21753, January 2014
Supervised by: Prof. Gustavo Alonso
@phdthesis{abc,
	author = {Tudor-Ioan Salomie},
	school = {21753},
	title = {Cloud-ready scalable and elastic data processing using off-the-shelf databases, replication and virtualization},
	year = {2014}
}
CoRR, January 2014
@article{abc,
	author = {Ce Zhang and Christopher R{\'e}},
	journal = {CoRR},
	title = {DimmWitted: A Study of Main-Memory Statistical Analytics.},
	url = {http://arxiv.org/abs/1403.7550},
	year = {2014}
}
CoRR, January 2014
@article{abc,
	author = {Ce Zhang and Christopher R{\'e} and Amir Abbas Sadeghian and Zifei Shan and Jaeho Shin and Feiran Wang and Sen Wu},
	journal = {CoRR},
	title = {Feature Engineering for Knowledge Base Construction.},
	url = {http://arxiv.org/abs/1407.6439},
	year = {2014}
}
CoRR, -, January 2014
The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framework designed to route tasks across and within online crowds. CrowdSTAR indexes the topic-specific expertise and social features of the crowd contributors and then uses a routing algorithm, which suggests the best sources to ask based on the knowledge vs. availability trade-offs. We experimented with the proposed framework for question and answering scenarios by using two popular social networks as crowd candidates: Twitter and Quora.
@inproceedings{abc,
	abstract = {The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framework designed to route tasks across and within online crowds. CrowdSTAR indexes the topic-specific expertise and social features of the crowd contributors and then uses a routing algorithm, which suggests the best sources to ask based on the knowledge vs. availability trade-offs. We experimented with the proposed framework for question and answering scenarios by using two popular social networks as crowd candidates: Twitter and Quora.},
	author = {Besmira Nushi and Omar Alonso and Martin Hentschel and Vasileios Kandylas},
	booktitle = {CoRR},
	title = {CrowdSTAR: A Social Task Routing Framework for Online Communities.},
	url = {http://arxiv.org/abs/1407.6714},
	venue = {-},
	year = {2014}
}

2013

The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 2013
@inproceedings{abc,
	author = {Vivek Seshadri and Yoongu Kim and Chris Fallin and Donghyuk Lee and Rachata Ausavarungnirun and Gennady Pekhimenko and Yixin Luo and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA},
	title = {RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.},
	url = {http://doi.acm.org/10.1145/2540708.2540725},
	year = {2013}
}
The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 2013
@inproceedings{abc,
	author = {Gennady Pekhimenko and Vivek Seshadri and Yoongu Kim and Hongyi Xin and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA},
	title = {Linearly compressed pages: a low-complexity, low-latency main memory compression framework.},
	url = {http://doi.acm.org/10.1145/2540708.2540724},
	year = {2013}
}
Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., Lake Tahoe, Nevada, United States., December 2013
@inproceedings{abc,
	author = {Srikrishna Sridhar and Stephen J. Wright and Christopher R{\'e} and Ji Liu and Victor Bittorf and Ce Zhang},
	booktitle = {Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States.},
	title = {An Approximate, Efficient LP Solver for LP Rounding.},
	url = {http://papers.nips.cc/paper/4990-an-approximate-efficient-lp-solver-for-lp-rounding},
	venue = {Lake Tahoe, Nevada, United States.},
	year = {2013}
}
Proceedings of The Twenty-Second Text REtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA, November 2013
@inproceedings{abc,
	author = {John R. Frank and Steven J. Bauer and Max Kleiman-Weiner and Daniel A. Roberts and Nilesh Tripuraneni and Ce Zhang and Christopher R{\'e} and Ellen M. Voorhees and Ian Soboroff},
	booktitle = {Proceedings of The Twenty-Second Text REtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA},
	title = {Evaluating Stream Filtering for Entity Profile Updates for TREC 2013.},
	url = {http://trec.nist.gov/pubs/trec22/papers/KBA.OVERVIEW.pdf},
	year = {2013}
}
Proceedings of The Twenty-Second Text REtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA, November 2013
@inproceedings{abc,
	author = {Tushar Khot and Ce Zhang and Jude W. Shavlik and Sriraam Natarajan and Christopher R{\'e}},
	booktitle = {Proceedings of The Twenty-Second Text REtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA},
	title = {Bootstrapping Knowledge Base Acceleration.},
	url = {http://trec.nist.gov/pubs/trec22/papers/wisc-kba.pdf},
	year = {2013}
}
Proceedings of the Seventh Workshop on Programming Languages and Operating Systems, PLOS 2013, Farmington, Pennsylvania, USA, November 2013
@inproceedings{abc,
	author = {Pravin Shinde and Antoine Kaufmann and Kornilios Kourtis and Timothy Roscoe},
	booktitle = {Proceedings of the Seventh Workshop on Programming Languages and Operating Systems, PLOS 2013, Farmington, Pennsylvania, USA},
	title = {Modeling NICs with Unicorn.},
	url = {http://doi.acm.org/10.1145/2525528.2525532},
	year = {2013}
}
2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA, October 2013
@inproceedings{abc,
	author = {Youyou Lu and Jiwu Shu and Jia Guo and Shuai Li and Onur Mutlu},
	booktitle = {2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA},
	title = {LightTx: A lightweight transactional design in flash-based SSDs to support flexible transactions.},
	url = {http://dx.doi.org/10.1109/ICCD.2013.6657033},
	year = {2013}
}
2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA, October 2013
@inproceedings{abc,
	author = {Yu Cai and Onur Mutlu and Erich F. Haratsch and Ken Mai},
	booktitle = {2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA},
	title = {Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation.},
	url = {http://dx.doi.org/10.1109/ICCD.2013.6657034},
	year = {2013}
}
Systems Group Master's Thesis, no. 101; Department of Computer Science, October 2013
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Denitsa Dobreva},
	school = {101},
	title = {Enterprise Social Networks Analysis},
	year = {2013}
}
Systems Group Master's Thesis, no. 94; Department of Computer Science, September 2013
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Patrick B{\"a}nziger},
	school = {94},
	title = {Exploiting multi-core parallelism with pipelining to solve skyline queries},
	year = {2013}
}
Systems Group Master's Thesis, no. 93; Department of Computer Science, September 2013
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Lynn Aders},
	school = {93},
	title = {Joins based on the Access Path Model for Crowdsourced Databases},
	year = {2013}
}
Systems Group Master's Thesis, no. 90; Department of Computer Science, September 2013
@mastersthesis{abc,
	author = {Daniel Widmer},
	school = {90},
	title = {Real-Time Analytics in a High Volume Event Processing System},
	year = {2013}
}
23rd International Conference on Field programmable Logic and Applications, FPL 2013, Porto, Portugal, September 2013
@inproceedings{abc,
	author = {Arvind Arasu and Ken Eguro and Raghav Kaushik and Donald Kossmann and Ravishankar Ramamurthy and Ramarathnam Venkatesan},
	booktitle = {23rd International Conference on Field programmable Logic and Applications, FPL 2013, Porto, Portugal},
	title = {A secure coprocessor for database applications.},
	url = {http://dx.doi.org/10.1109/FPL.2013.6645524},
	year = {2013}
}
23rd International Conference on Field programmable Logic and Applications, FPL 2013, Porto, Portugal, September 2013
@inproceedings{abc,
	author = {Zsolt Istv{\'a}n and Gustavo Alonso and Michaela Blott and Kees A. Vissers},
	booktitle = {23rd International Conference on Field programmable Logic and Applications, FPL 2013, Porto, Portugal},
	title = {A flexible hash table design for 10GBPS key-value stores on FPGAS.},
	url = {http://dx.doi.org/10.1109/FPL.2013.6645520},
	year = {2013}
}
23rd International Conference on Field programmable Logic and Applications, FPL 2013, Porto, Portugal, September 2013
@inproceedings{abc,
	author = {Louis Woods and Zsolt Istv{\'a}n and Gustavo Alonso},
	booktitle = {23rd International Conference on Field programmable Logic and Applications, FPL 2013, Porto, Portugal},
	title = {Hybrid FPGA-accelerated SQL query processing.},
	url = {http://dx.doi.org/10.1109/FPL.2013.6645619},
	year = {2013}
}
Performance Characterization and Benchmarking - 5th TPC Technology Conference, TPCTC 2013, Trento, Italy, Revised Selected Papers, August 2013
@inproceedings{abc,
	author = {Martin Kaufmann and Peter M. Fischer and Norman May and Andreas Tonder and Donald Kossmann},
	booktitle = {Performance Characterization and Benchmarking - 5th TPC Technology Conference, TPCTC 2013, Trento, Italy},
	title = {TPC-BiH: A Benchmark for Bitemporal Databases.},
	url = {http://dx.doi.org/10.1007/978-3-319-04936-6_2},
	venue = {Revised Selected Papers},
	year = {2013}
}
Proceedings of the First VLDB Workshop on Databases and Crowdsourcing, DBCrowd 2013, Riva del Garda, Trento, Italy, August 2013
@inproceedings{abc,
	author = {Anja Gruenheid and Donald Kossmann},
	booktitle = {Proceedings of the First VLDB Workshop on Databases and Crowdsourcing, DBCrowd 2013, Riva del Garda, Trento, Italy},
	title = {Cost and Quality Trade-Offs in Crowdsourcing.},
	url = {http://ceur-ws.org/Vol-1025/vision2.pdf},
	year = {2013}
}
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Sofia, Bulgaria, Volume 2: Short Papers, August 2013
@inproceedings{abc,
	author = {Vidhya Govindaraju and Ce Zhang and Christopher R{\'e}},
	booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013},
	title = {Understanding Tables in Context Using Standard NLP Toolkits.},
	url = {http://aclweb.org/anthology/P/P13/P13-2116.pdf},
	venue = {Sofia, Bulgaria, Volume 2: Short Papers},
	year = {2013}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 2013
@inproceedings{abc,
	author = {Arvind Arasu and Spyros Blanas and Ken Eguro and Manas Joglekar and Raghav Kaushik and Donald Kossmann and Ravishankar Ramamurthy and Prasang Upadhyaya and Ramarathnam Venkatesan},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA},
	title = {Secure database-as-a-service with Cipherbase.},
	url = {http://doi.acm.org/10.1145/2463676.2467797},
	year = {2013}
}
The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013
@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {Orchestrated scheduling and prefetching for GPGPUs.},
	url = {http://doi.acm.org/10.1145/2485922.2485951},
	year = {2013}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 2013
@inproceedings{abc,
	author = {Ce Zhang and Christopher R{\'e}},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA},
	title = {Towards high-throughput gibbs sampling at scale: a study across storage managers.},
	url = {http://doi.acm.org/10.1145/2463676.2463702},
	year = {2013}
}
The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013
@inproceedings{abc,
	author = {Jamie Liu and Ben Jaiyen and Yoongu Kim and Chris Wilkerson and Onur Mutlu},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms.},
	url = {http://doi.acm.org/10.1145/2485922.2485928},
	year = {2013}
}
Systems Group Master's Thesis, no. 84; Department of Computer Science, June 2013
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Zaheer Chothia},
	school = {84},
	title = {Investigating OS/DB co-design with SharedDB and Barrelfish},
	year = {2013}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 2013
@inproceedings{abc,
	author = {Ce Zhang and Vidhya Govindaraju and Jackson Borchardt and Tim Foltz and Christopher R{\'e} and Shanan Peters},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA},
	title = {GeoDeepDive: statistical inference using familiar data-processing languages.},
	url = {http://doi.acm.org/10.1145/2463676.2463680},
	year = {2013}
}
The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 2013
@inproceedings{abc,
	author = {Jos{\'e} A. Joao and M. Aater Suleman and Onur Mutlu and Yale N. Patt},
	booktitle = {The 40th Annual International Symposium on Computer Architecture, ISCA{\textquoteright}13, Tel-Aviv, Israel},
	title = {Utility-based acceleration of multithreaded applications on asymmetric CMPs.},
	url = {http://doi.acm.org/10.1145/2485922.2485936},
	year = {2013}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 2013
@inproceedings{abc,
	author = {Martin Kaufmann and Amin Amiri Manjili and Panagiotis Vagenas and Peter M. Fischer and Donald Kossmann and Franz F{\"a}rber and Norman May},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA},
	title = {Timeline index: a unified data structure for processing queries on temporal data in SAP HANA.},
	url = {http://doi.acm.org/10.1145/2463676.2465293},
	year = {2013}
}
The 7th ACM International Conference on Distributed Event-Based Systems, DEBS '13, Arlington, TX, June 2013
@inproceedings{abc,
	author = {Cagri Balkesen and Nesime Tatbul and M. Tamer {\"O}zsu},
	booktitle = {The 7th ACM International Conference on Distributed Event-Based Systems, DEBS {\textquoteright}13, Arlington, TX},
	title = {Adaptive input admission and management for parallel stream processing.},
	url = {http://doi.acm.org/10.1145/2488222.2488258},
	year = {2013}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 2013
@inproceedings{abc,
	author = {Georgios Giannikis and Darko Makreshanski and Gustavo Alonso and Donald Kossmann},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA},
	title = {Workload optimization using SharedDB.},
	url = {http://doi.acm.org/10.1145/2463676.2463678},
	year = {2013}
}
The 7th ACM International Conference on Distributed Event-Based Systems, DEBS '13, Arlington, TX, June 2013
@inproceedings{abc,
	author = {Cagri Balkesen and Nihal Dindar and Matthias Wetter and Nesime Tatbul},
	booktitle = {The 7th ACM International Conference on Distributed Event-Based Systems, DEBS {\textquoteright}13, Arlington, TX},
	title = {RIP: run-based intra-query parallelism for scalable complex event processing.},
	url = {http://doi.acm.org/10.1145/2488222.2488257},
	year = {2013}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 2013
@inproceedings{abc,
	author = {Louis Woods and Jens Teubner and Gustavo Alonso},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA},
	title = {Less watts, more performance: an intelligent storage engine for data appliances.},
	url = {http://doi.acm.org/10.1145/2463676.2463685},
	year = {2013}
}
The 7th ACM International Conference on Distributed Event-Based Systems, DEBS '13, Arlington, TX, June 2013
@inproceedings{abc,
	author = {Boris Glavic and Kyumars Sheykh Esmaili and Peter M. Fischer and Nesime Tatbul},
	booktitle = {The 7th ACM International Conference on Distributed Event-Based Systems, DEBS {\textquoteright}13, Arlington, TX},
	title = {Ariadne: managing fine-grained provenance on data streams.},
	url = {http://doi.acm.org/10.1145/2488222.2488256},
	year = {2013}
}
Networked Systems - First International Conference, NETYS 2013, Marrakech, Morocco, Revised Selected Papers, May 2013
@inproceedings{abc,
	author = {Ercan Ucan and Timothy Roscoe},
	booktitle = {Networked Systems - First International Conference, NETYS 2013, Marrakech, Morocco},
	title = {Establishing Efficient Routes between Personal Clouds.},
	url = {http://dx.doi.org/10.1007/978-3-642-40148-0_6},
	venue = {Revised Selected Papers},
	year = {2013}
}
Systems Group Master's Thesis, no. 83; Department of Computer Science, May 2013
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Lukas Humbel},
	school = {83},
	title = {Multicore Virtualization over a Multikernel},
	year = {2013}
}
The 50th Annual Design Automation Conference 2013, DAC '13, Austin, TX, USA, May 2013
@inproceedings{abc,
	author = {Asit K. Mishra and Onur Mutlu and Chita R. Das},
	booktitle = {The 50th Annual Design Automation Conference 2013, DAC {\textquoteright}13, Austin, TX, USA},
	title = {A heterogeneous multiple network-on-chip design: an application-aware approach.},
	url = {http://doi.acm.org/10.1145/2463209.2488779},
	year = {2013}
}
14th Workshop on Hot Topics in Operating Systems, HotOS XIV, Santa Ana Pueblo, New Mexico, USA, May 2013
@inproceedings{abc,
	author = {Pravin Shinde and Antoine Kaufmann and Timothy Roscoe and Stefan Kaestle},
	booktitle = {14th Workshop on Hot Topics in Operating Systems, HotOS XIV, Santa Ana Pueblo, New Mexico, USA},
	title = {We Need to Talk About NICs.},
	url = {https://www.usenix.org/conference/hotos13/session/shinde},
	year = {2013}
}
Systems Group Master's Thesis, no. 78; Department of Computer Science, April 2013
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Georgios Gasparis},
	school = {78},
	title = {AIM: A System for Handling Enormous Workloads under Strict Latency and Scalability Regulations},
	year = {2013}
}
Eighth Eurosys Conference 2013, EuroSys '13, Prague, Czech Republic, April 2013
@inproceedings{abc,
	author = {Tudor-Ioan Salomie and Gustavo Alonso and Timothy Roscoe and Kevin Elphinstone},
	booktitle = {Eighth Eurosys Conference 2013, EuroSys {\textquoteright}13, Prague, Czech Republic},
	title = {Application level ballooning for efficient server consolidation.},
	url = {http://doi.acm.org/10.1145/2465351.2465384},
	year = {2013}
}
29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 2013
@inproceedings{abc,
	author = {Martin Kaufmann and Peter M. Fischer and Donald Kossmann and Norman May},
	booktitle = {29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia},
	title = {A generic database benchmarking service.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2013.6544923},
	year = {2013}
}
Systems Group Master's Thesis, no. 80; Department of Computer Science, April 2013
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Danilo Medic},
	school = {80},
	title = {Model Driven Optical Form Recognition},
	year = {2013}
}
Eighth Eurosys Conference 2013, EuroSys '13, Prague, Czech Republic, April 2013
@inproceedings{abc,
	author = {Gernot Heiser and Etienne Le Sueur and Adrian Danis and Aleksander Budzynowski and Tudor-Ioan Salomie and Gustavo Alonso},
	booktitle = {Eighth Eurosys Conference 2013, EuroSys {\textquoteright}13, Prague, Czech Republic},
	title = {RapiLog: reducing system complexity through verification.},
	url = {http://doi.acm.org/10.1145/2465351.2465383},
	year = {2013}
}
2012 IEEE International Symposium on Performance Analysis of Systems Software, Austin, TX, USA, April 2013
@inproceedings{abc,
	author = {Emre Kultursay and Mahmut T. Kandemir and Anand Sivasubramaniam and Onur Mutlu},
	booktitle = {2012 IEEE International Symposium on Performance Analysis of Systems  Software, Austin, TX, USA},
	title = {Evaluating STT-RAM as an energy-efficient main memory alternative.},
	url = {http://dx.doi.org/10.1109/ISPASS.2013.6557176},
	year = {2013}
}
21st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2013, Seattle, WA, USA, April 2013
@inproceedings{abc,
	author = {Louis Woods and Gustavo Alonso and Jens Teubner},
	booktitle = {21st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2013, Seattle, WA, USA},
	title = {Parallel Computation of Skyline Queries.},
	url = {http://doi.ieeecomputersociety.org/10.1109/FCCM.2013.18},
	year = {2013}
}
Systems Group Master's Thesis, no. 81; Department of Computer Science, April 2013
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Filip Curcic},
	school = {81},
	title = {Timeline Index on RamCloud},
	year = {2013}
}
2012 IEEE International Symposium on Performance Analysis of Systems Software, Austin, TX, USA, April 2013
@inproceedings{abc,
	author = {Chuanjun Zhang and Glenn G. Ko and Jungwook Choi and Shang-nien Tsai and Minje Kim and Abner Guzm{\'a}n-Rivera and Rob A. Rutenbar and Paris Smaragdis and Mi Sun Park and Narayanan Vijaykrishnan and Hongyi Xin and Onur Mutlu and Bin Li and Li Zhao and Mei Chen},
	booktitle = {2012 IEEE International Symposium on Performance Analysis of Systems  Software, Austin, TX, USA},
	title = {EMERALD: Characterization of emerging applications and algorithms for low-power devices.},
	url = {http://dx.doi.org/10.1109/ISPASS.2013.6557154},
	year = {2013}
}
29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 2013
@inproceedings{abc,
	author = {Gustavo Alonso},
	booktitle = {29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia},
	title = {Hardware killed the software star.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2013.6544807},
	year = {2013}
}
29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 2013
@inproceedings{abc,
	author = {Cagri Balkesen and Jens Teubner and Gustavo Alonso and M. Tamer {\"O}zsu},
	booktitle = {29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia},
	title = {Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2013.6544839},
	year = {2013}
}
Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, April 2013
@inproceedings{abc,
	author = {Junda Liu and Aurojit Panda and Ankit Singla and Brighten Godfrey and Michael Schapira and Scott Shenker},
	booktitle = {Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA},
	title = {Ensuring Connectivity via Data Plane Mechanisms.},
	url = {https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/liu_junda},
	year = {2013}
}
Systems Group Master's Thesis, no. 74; Department of Computer Science, April 2013
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {David Sidler},
	school = {74},
	title = {Column Storage for FPGA-accelerated Data Analytics},
	year = {2013}
}
29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 2013
@inproceedings{abc,
	author = {Martin Kaufmann and Amin Amiri Manjili and Stefan Hildenbrand and Donald Kossmann and Andreas Tonder},
	booktitle = {29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia},
	title = {Time travel in column stores.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2013.6544818},
	year = {2013}
}
Systems Group Master's Thesis, no. 75; Department of Computer Science, March 2013
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Ilias Rinis},
	school = {75},
	title = {Exploring scalable transactional data processing in a cluster of multicores},
	year = {2013}
}
Systems Group Master's Thesis, no. 82; Department of Computer Science, March 2013
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Karolina Alexiou},
	school = {82},
	title = {Adaptive Range Filters for Query Optimization},
	year = {2013}
}
Systems Group Master's Thesis, no. 71; Department of Computer Science, March 2013
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Andreas Tschofen},
	school = {71},
	title = {Joint Inference of Concepts and Networks of Documents},
	year = {2013}
}
Design, Automation and Test in Europe, DATE 13, Grenoble, France, March 2013
@inproceedings{abc,
	author = {Yu Cai and Erich F. Haratsch and Onur Mutlu and Ken Mai},
	booktitle = {Design, Automation and Test in Europe, DATE 13, Grenoble, France},
	title = {Threshold voltage distribution in MLC NAND flash memory: characterization, analysis, and modeling.},
	year = {2013}
}
Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, Houston, TX, March 2013
@inproceedings{abc,
	author = {Adwait Jog and Onur Kayiran and Nachiappan Chidambaram Nachiappan and Asit K. Mishra and Mahmut T. Kandemir and Onur Mutlu and Ravishankar Iyer and Chita R. Das},
	booktitle = {Architectural Support for Programming Languages and Operating Systems, ASPLOS {\textquoteright}13, Houston, TX},
	title = {OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance.},
	url = {http://doi.acm.org/10.1145/2451116.2451158},
	year = {2013}
}
Systems Group Master's Thesis, no. 85; Department of Computer Science, March 2013
@mastersthesis{abc,
	author = {Zsolt Istv{\'a}n},
	school = {85},
	title = {Hash Table for Large Key-Value Stores on FPGAs},
	year = {2013}
}
Joint 2013 EDBT/ICDT Conferences, EDBT '13 Proceedings, Genoa, Italy, March 2013
@inproceedings{abc,
	author = {Christian Tinnefeld and Donald Kossmann and Martin Grund and Joos-Hendrik B{\"o}se and Frank Renkes and Vishal Sikka and Hasso Plattner},
	booktitle = {Joint 2013 EDBT/ICDT Conferences, EDBT {\textquoteright}13 Proceedings, Genoa, Italy},
	title = {Elastic online analytical processing on RAMCloud.},
	url = {http://doi.acm.org/10.1145/2452376.2452429},
	year = {2013}
}
Systems Group Master's Thesis, no. 70; Department of Computer Science, March 2013
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Julien Ribon},
	school = {70},
	title = {Big Data Query Parallelization},
	year = {2013}
}
19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013
@inproceedings{abc,
	author = {Lavanya Subramanian and Vivek Seshadri and Yoongu Kim and Ben Jaiyen and Onur Mutlu},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {MISE: Providing performance predictability and improving fairness in shared main memory systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522356},
	year = {2013}
}
19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013
@inproceedings{abc,
	author = {Reetuparna Das and Rachata Ausavarungnirun and Onur Mutlu and Akhilesh Kumar and Mani Azimi},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {Application-to-core mapping policies to reduce memory system interference in multi-core systems.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522311},
	year = {2013}
}
19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 2013
@inproceedings{abc,
	author = {Donghyuk Lee and Yoongu Kim and Vivek Seshadri and Jamie Liu and Lavanya Subramanian and Onur Mutlu},
	booktitle = {19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China},
	title = {Tiered-latency DRAM: A low latency and low cost DRAM architecture.},
	url = {http://dx.doi.org/10.1109/HPCA.2013.6522354},
	year = {2013}
}
PVLDB, January 2013
@inproceedings{abc,
	author = {Karolina Alexiou and Donald Kossmann and Per-Ake Larson},
	booktitle = {PVLDB},
	title = {Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia.},
	url = {http://www.vldb.org/pvldb/vol6/p1714-kossmann.pdf},
	year = {2013}
}
Bulletin of the EATCS, January 2013
@article{abc,
	author = {Anja Gruenheid and Donald Kossmann and Besmira Nushi and Yuri Gurevich},
	journal = {Bulletin of the EATCS},
	title = {When is A=B?},
	url = {http://eatcs.org/beatcs/index.php/beatcs/article/view/206},
	year = {2013}
}
BMC Genomics, January 2013
@article{abc,
	author = {Hongyi Xin and Donghyuk Lee and Farhad Hormozdiari and Samihan Yedkar and Onur Mutlu and Can Alkan},
	journal = {BMC Genomics},
	title = {Accelerating read mapping with FastHASH.},
	url = {http://dx.doi.org/10.1186/1471-2164-14-S1-S13},
	year = {2013}
}
CoRR, -, January 2013
Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the ex- act one. These approximate LP solutions can be computed efficiently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation of the LP. We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude faster than Cplex (a commercial LP solver) while producing solutions of similar quality.
@inproceedings{abc,
	abstract = {Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the ex- act one. These approximate LP solutions can be computed efficiently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation of the LP. We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude faster than Cplex (a commercial LP solver) while producing solutions of similar quality.},
	author = {Srikrishna Sridhar and Victor Bittorf and Ji Liu and Ce Zhang and Christopher R{\'e} and Stephen J. Wright},
	booktitle = {CoRR},
	title = {An Approximate, Efficient Solver for LP Rounding.},
	url = {http://arxiv.org/abs/1311.2661},
	venue = {-},
	year = {2013}
}
January 2013
@techreport{abc,
	author = {Tahmineh Sanamrad and Lucas Braun and Andreas Marfurt and Donald Kossmann and Ramarathnam Venkatesan},
	title = {POP: A new Encryption Scheme for Dynamic Databases},
	year = {2013}
}
CoRR, January 2013
@inproceedings{abc,
	author = {Mario Paolucci and Donald Kossmann and Rosaria Conte and Paul Lukowicz and Panos Argyrakis and Ann Blandford and Giulia Bonelli and Stuart Anderson and Sara de Freitas and Bruce Edmonds and G. Nigel Gilbert and Markus H. Gross and J{\"o}rn Kohlhammer and Petros Koumoutsakos and Andreas Krause and Bj{\"o}rn-Ola Linn{\'e}r and Philipp Slusallek and Olga Sorkine-Hornung and Robert W. Sumner and Dirk Helbing},
	booktitle = {CoRR},
	title = {Towards a living earth simulator},
	url = {http://arxiv.org/abs/1304.1903},
	year = {2013}
}
In Search of Elegance in the Theory and Practice of Computation - Essays Dedicated to Peter Buneman, January 2013
@inproceedings{abc,
	author = {Boris Glavic and Ren{\'e}e J. Miller and Gustavo Alonso},
	booktitle = {In Search of Elegance in the Theory and Practice of Computation - Essays Dedicated to Peter Buneman},
	title = {Using SQL for Efficient Generation and Querying of Provenance Information.},
	url = {http://dx.doi.org/10.1007/978-3-642-41660-6_16},
	year = {2013}
}
January 2013
An increasing number of applications such as risk evaluation in banking or inventory management require support for temporal data. After more than a decade of standstill, the recent adoption of some bitemporal features in SQL:2011 has reinvigorated the support among commercial database vendors, who incorporate an increasing number of relevant bitemporal features. Naturally, assessing the performance and scalability of temporal data storage and operations is of great concern for potential users. The cost of keeping and querying history with novel operations (such as time travel, temporal joins or temporal aggregations) is not adequately reflected in any existing benchmark. In this paper, we present a benchmark proposal which provides comprehensive coverage of the bitemporal data management. It builds on the solid foundations of TPC-H but extends it with a rich set of queries and update scenarios. This workload stems both from real-life temporal applications from SAP's customer base and a systematic coverage of temporal operators proposed in the academic literature. In the accompanying paper we present preliminary results of our benchmark on a number of temporal database systems, also highlighting the need for certain language extensions. In the appendix of this technical report we provide all details required to implement the benchmark.
@techreport{abc,
	abstract = {An increasing number of applications such as risk evaluation in banking or inventory management require support for temporal data.
After more than a decade of standstill, the recent adoption of some bitemporal features in SQL:2011 has reinvigorated the support among commercial database vendors, who incorporate an increasing number of relevant bitemporal features. Naturally, assessing the performance and scalability of temporal data storage and operations is of great concern for potential users.
The cost of keeping and querying history with novel operations (such as time travel, temporal joins or temporal aggregations) is not adequately reflected in any existing benchmark.
In this paper, we present a benchmark proposal which provides comprehensive coverage of the bitemporal data management.
It builds on the solid foundations of TPC-H but extends it with a rich set of queries and update scenarios.
This workload stems both from real-life temporal applications from SAP{\textquoteright}s customer base and a systematic coverage of temporal operators proposed in the academic literature.
In the accompanying paper we present preliminary results of our benchmark on a number of temporal database systems, also highlighting the need for certain language extensions.
In the appendix of this technical report we provide all details required to implement the benchmark.},
	author = {Martin Kaufmann and Peter M. Fischer and Norman May and Donald Kossmann},
	title = {Benchmarking Databases with History Support},
	url = {http://dx.doi.org/10.3929/ethz-a-009994978},
	year = {2013}
}
CoRR, January 2013
@article{abc,
	author = {Fosca Giannotti and Dino Pedreschi and Alex Pentland and Paul Lukowicz and Donald Kossmann and James L. Crowley and Dirk Helbing},
	journal = {CoRR},
	title = {A planetary nervous system for social mining and collective awareness},
	url = {http://arxiv.org/abs/1304.3700},
	year = {2013}
}
PVLDB, January 2013
@article{abc,
	author = {Cagri Balkesen and Gustavo Alonso and Jens Teubner and M. Tamer {\"O}zsu},
	journal = {PVLDB},
	title = {Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited.},
	url = {http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf},
	year = {2013}
}
Procedings of HotCloud '13 (5th USENIX Workshop on Hot Topics in Cloud Computing), San Hose, CA, USA, January 2013
@inproceedings{abc,
	author = {Michaela Blott and Kimon Karras and Ling Liu and Kees Vissers and Jeremia B{\"a}r and Zsolt Istv{\'a}n},
	booktitle = {Procedings of HotCloud {\textquoteright}13 (5th USENIX Workshop on Hot Topics in Cloud Computing)},
	title = {Achieving 10Gbps line-rate key-value stores with FPGAs},
	venue = {San Hose, CA, USA},
	year = {2013}
}
CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 2013
@inproceedings{abc,
	author = {Arvind Arasu and Spyros Blanas and Ken Eguro and Raghav Kaushik and Donald Kossmann and Ravishankar Ramamurthy and Ramarathnam Venkatesan},
	booktitle = {CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA},
	title = {Orthogonal Security with Cipherbase.},
	url = {http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper33.pdf},
	year = {2013}
}
PVLDB, January 2013
@inproceedings{abc,
	author = {Martin Kaufmann},
	booktitle = {PVLDB},
	title = {Storing and Processing Temporal Data in a Main Memory Column Store.},
	url = {http://www.vldb.org/pvldb/vol6/p1444-kaufmann.pdf},
	year = {2013}
}
Proceedings of the Workshop on Secure Data Management Workshop, in conjunction with VLDB, Riva del Garda, Italy, January 2013
@inproceedings{abc,
	author = {Tahmineh Sanamrad and Donald Kossmann},
	booktitle = {Proceedings of the  Workshop on Secure Data Management Workshop, in conjunction with VLDB, Riva del Garda, Italy},
	title = {Query Log Attack on Encrypted Databases},
	year = {2013}
}
CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 2013
@inproceedings{abc,
	author = {Jana Giceva and Tudor-Ioan Salomie and Adrian Sch{\"u}pbach and Gustavo Alonso and Timothy Roscoe},
	booktitle = {CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA},
	title = {COD: Database / Operating System Co-Design.},
	url = {http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper71.pdf},
	year = {2013}
}
PVLDB, January 2013
@inproceedings{abc,
	author = {Martin Kaufmann and Panagiotis Vagenas and Peter M. Fischer and Donald Kossmann and Franz F{\"a}rber},
	booktitle = {PVLDB},
	title = {Comprehensive and Interactive Temporal Query Processing with SAP HANA.},
	url = {http://www.vldb.org/pvldb/vol6/p1210-kaufmann.pdf},
	year = {2013}
}
ACM Transactions on Database Systems (TODS), vol. 38(4), January 2013
@article{abc,
	author = {Chongling Nie and Louis Woods and Jens Teubner},
	journal = {ACM Transactions on Database Systems (TODS), vol. 38(4)},
	title = {XLynx\&$\#$151;An FPGA-based XML Filter for Hybrid XQuery Processing.},
	year = {2013}
}
CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 2013
@inproceedings{abc,
	author = {Michael Anderson and Dolan Antenucci and Victor Bittorf and Matthew Burgess and Michael J. Cafarella and Arun Kumar and Feng Niu and Yongjoo Park and Christopher R{\'e} and Ce Zhang},
	booktitle = {CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA},
	title = {Brainwash: A Data System for Feature Engineering.},
	url = {http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper82.pdf},
	year = {2013}
}
CoRR, January 2013
@article{abc,
	author = {Ankit Singla and Brighten Godfrey and Alexandra Kolla},
	journal = {CoRR},
	title = {High Throughput Data Center Topology Design.},
	url = {http://arxiv.org/abs/1309.7066},
	year = {2013}
}
CoRR, January 2013
@article{abc,
	author = {Ankit Singla and Brighten Godfrey and Kevin R. Fall and Gianluca Iannaccone and Sylvia Ratnasamy},
	journal = {CoRR},
	title = {Scalable Routing on Flat Names},
	url = {http://arxiv.org/abs/1302.6156},
	year = {2013}
}
VLDB J., January 2013
@inproceedings{abc,
	author = {Nihal Dindar and Nesime Tatbul and Ren{\'e}e J. Miller and Laura M. Haas and Irina Botan},
	booktitle = {VLDB J.},
	title = {Modeling the execution semantics of stream processing engines with SECRET.},
	url = {http://dx.doi.org/10.1007/s00778-012-0297-3},
	year = {2013}
}
Datenbanksysteme für Business, Technologie und Web (BTW), - Workshopband, 15. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), 11.-15.3.2013 in Magdeburg, Germany. Proceedings, January 2013
@inproceedings{abc,
	author = {Cagri Balkesen and Louis Woods and Jens Teubner},
	booktitle = {Datenbanksysteme f{\"u}r Business, Technologie und Web (BTW), - Workshopband, 15. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), 11.-15.3.2013 in Magdeburg, Germany. Proceedings},
	title = {Tutorium: Neue Hardwarearchitekturen f{\"u}r das Datenmanagement (DPMH).},
	year = {2013}
}

2012

ETH Zürich, Diss. Nr. 20931, December 2012
Supervised by: Prof. Nesime Tatbul
@phdthesis{abc,
	author = {Alexandru Moga},
	school = {20931},
	title = {Load management for streaming analytics },
	year = {2012}
}
12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 2012
@inproceedings{abc,
	author = {Feng Niu and Ce Zhang and Christopher R{\'e} and Jude W. Shavlik},
	booktitle = {12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium},
	title = {Scaling Inference for Markov Logic via Dual Decomposition.},
	url = {http://dx.doi.org/10.1109/ICDM.2012.96},
	year = {2012}
}
Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada, December 2012
@inproceedings{abc,
	author = {Ioana Giurgiu and Claris Castillo and Asser N. Tantawi and Malgorzata Steinder},
	booktitle = {Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada},
	title = {Enabling Efficient Placement of Virtual Infrastructures in the Cloud.},
	url = {http://dx.doi.org/10.1007/978-3-642-35170-9_17},
	year = {2012}
}
Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada, December 2012
@inproceedings{abc,
	author = {Ioana Giurgiu and Oriana Riva and Gustavo Alonso},
	booktitle = {Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada},
	title = {Dynamic Software Deployment from Clouds to Mobile Devices.},
	url = {http://dx.doi.org/10.1007/978-3-642-35170-9_20},
	year = {2012}
}
Systems Group Master's Thesis, no. 64; Department of Computer Science, December 2012
Supervised by: Prof. Nesime Tatbul
@mastersthesis{abc,
	author = {Matthias Wetter},
	school = {64},
	title = {Parallel Processing of Pattern Matching Queries},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20930, December 2012
Supervised by: Prof. Timothy Roscoe
@phdthesis{abc,
	author = {Adrian Sch{\"u}pbach},
	school = {20930},
	title = {Tackling OS Complexity with declarative techniques},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20932, December 2012
Supervised by: Prof. Timothy Roscoe
@phdthesis{abc,
	author = {Qin Yin},
	school = {20932},
	title = {Declarative Recource Management for Virtual Network Systems},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20949, December 2012
Supervised by: Prof. Nesime Tatbul
@phdthesis{abc,
	author = {Nihal Dindar},
	school = {20949},
	title = {Modeling Window Execution Semantics of Stream Processing Engines},
	year = {2012}
}
Systems Group Master's Thesis, no. 59; Department of Computer Science, November 2012
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Erfan Zamanian},
	school = {59},
	title = {Query Optimization in CrowdDB},
	year = {2012}
}
Proceedings of The Twenty-First Text REtrieval Conference, TREC 2012, Gaithersburg, Maryland, USA, November 2012
@inproceedings{abc,
	author = {John R. Frank and Max Kleiman-Weiner and Daniel A. Roberts and Feng Niu and Ce Zhang and Christopher R{\'e} and Ian Soboroff},
	booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, TREC 2012, Gaithersburg, Maryland, USA},
	title = {Building an Entity-Centric Stream Filtering Test Collection for TREC 2012.},
	url = {http://trec.nist.gov/pubs/trec21/papers/KBA.OVERVIEW.pdf},
	year = {2012}
}
Systems Group Master's Thesis, no. 76; Department of Computer Science, October 2012
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Milos Andjelkovic},
	school = {76},
	title = {High Availability in Service Oriented Architectures},
	year = {2012}
}
Systems Group Master's Thesis, no. 34; Department of Computer Science, October 2012
Supervised by: Prof. Donald Kossmann
In most computer systems log records are an important source for failure detection, monitoring, statistics and various kinds of analytics. Simple log file generation as it is often implemented is not sufficient anymore for large computer systems generating hundreds of gigabytes of log data per day. Those systems need a more sophisticated log storage platform which is able to store hundreds of thousands of log entries per second and offering high performance query capabilities. This thesis proposes a solution using open-source components like Apache HBase and Apache Solr and presents benchmarks of concrete implementations using real data from Amadeus, world’s leading travel transaction operator.
@mastersthesis{abc,
	abstract = {In most computer systems log records are an important source for failure detection, monitoring, statistics and various kinds of analytics. Simple log file generation as it is often implemented is not sufficient anymore for large computer systems generating hundreds of gigabytes of log data per day. Those systems need a more sophisticated log storage platform which is able to store hundreds of thousands of log entries per second and offering high performance query capabilities. This thesis proposes a solution using open-source components like Apache HBase and Apache Solr and presents benchmarks of concrete implementations using real data from Amadeus, world{\textquoteright}s leading travel transaction operator.},
	author = {Martin Alig},
	school = {34},
	title = {Database Logging System},
	year = {2012}
}
Systems Group Master's Thesis, no. 61; Department of Computer Science, October 2012
Supervised by: Prof. Nesime Tatbul
@mastersthesis{abc,
	author = {Ren{\'e} Buffat},
	school = {61},
	title = {Application of Recency Window in StreamInsight for Service Intelligence},
	year = {2012}
}
Systems Group Master's Thesis, no. 58; Department of Computer Science, October 2012
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Adiya Abisheva},
	school = {58},
	title = {Crowdsourced Order: Getting Top N Values From the Crowd},
	year = {2012}
}
IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, New York, NY, USA, October 2012
@inproceedings{abc,
	author = {Kevin Kai-Wei Chang and Rachata Ausavarungnirun and Chris Fallin and Onur Mutlu},
	booktitle = {IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, New York, NY, USA},
	title = {HAT: Heterogeneous Adaptive Throttling for On-Chip Networks.},
	url = {http://doi.ieeecomputersociety.org/10.1109/SBAC-PAD.2012.44},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Gennady Pekhimenko and Todd C. Mowry and Onur Mutlu},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Linearly compressed pages: a main memory compression framework with low complexity and low latency.},
	url = {http://doi.acm.org/10.1145/2370816.2370911},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20692, September 2012
Supervised by: Prof. Gustavo Alonso
@phdthesis{abc,
	author = {Ioana Giurgiu},
	school = {20692},
	title = {Integrating cloud applications with mobile devices },
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Vivek Seshadri and Onur Mutlu and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {The evicted-address filter: a unified mechanism to address both cache pollution and thrashing.},
	url = {http://doi.acm.org/10.1145/2370816.2370868},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20664, September 2012
Supervised by: Prof. Timothy Roscoe
@phdthesis{abc,
	author = {Simon Peter},
	school = {20664},
	title = {Resource Management in a Multicore Operating System},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Gennady Pekhimenko and Vivek Seshadri and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Base-delta-immediate compression: practical data compression for on-chip caches.},
	url = {http://doi.acm.org/10.1145/2370816.2370870},
	year = {2012}
}
Systems Group Master's Thesis, no. 57; Department of Computer Science, September 2012
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Chongling Nie},
	school = {57},
	title = {An FPGA-based Smart Database Storage Engine},
	year = {2012}
}
Systems Group Master's Thesis, no. 66; Department of Computer Science, September 2012
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Georg Polzer},
	school = {66},
	title = {Near-realtime Pattern Analysis on Big Data},
	year = {2012}
}
30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012
@inproceedings{abc,
	author = {Yu Cai and Gulay Yalcin and Onur Mutlu and Erich F. Haratsch and Adri{\'a}n Cristal and Osman S. Unsal and Ken Mai},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICCD.2012.6378623},
	year = {2012}
}
Information Computing and Applications - Third International Conference, ICICA 2012, Chengde, China, September 2012
@inproceedings{abc,
	author = {Ce Zhang and Gang Cui and Bin Jin and Liang Wang},
	booktitle = {Information Computing and Applications - Third International Conference, ICICA 2012, Chengde, China},
	title = {Study of Trustworthiness Measurement and Kernel Modules Accessing Address Space of Any Process.},
	url = {http://dx.doi.org/10.1007/978-3-642-34062-8_56},
	year = {2012}
}
Systems Group Master's Thesis, no. 56; Department of Computer Science, September 2012
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Thomas Etter},
	school = {56},
	title = {Distributed Snapshot Isolation on RamCloud},
	year = {2012}
}
30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012
@inproceedings{abc,
	author = {Justin Meza and Jing Li and Onur Mutlu},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {A case for small row buffers in non-volatile main memories.},
	url = {http://dx.doi.org/10.1109/ICCD.2012.6378685},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Reetuparna Das and Rachata Ausavarungnirun and Onur Mutlu and Akhilesh Kumar and Mani Azimi},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Application-to-core mapping policies to reduce memory interference in multi-core systems.},
	url = {http://doi.acm.org/10.1145/2370816.2370893},
	year = {2012}
}
Systems Group Master's Thesis, no. 60; Department of Computer Science, September 2012
Supervised by: Prof. Gustavo Alonso
This thesis tackles the problem of allocating resources for networked applications in data center topologies, which is a known NP-hard decision problem. Various proposals have been introduced in the past which either make simplifying assumptions or solve a special instance of the problem. Our approach satisfies all demands on the application side as well as resource constraints of data centers, while remaining scalable. What is more, it is a generic solution that can be further tailored to specific workloads or cloud architectures. In this report we propose and evaluate two algorithms that perform the resource allocation decision and placement for various application topologies and workloads. The first one utilizes a greedy approach of placing virtual links sequentially and back tracking when a constraint is not met. The second one clusters the application in a network optimized package, deducts a subgraph of the data center topology and makes the placement decision using a customizable heuristic. Depending on the decision, the placement takes place in the reduced subgraph. Both algorithm implementations have been evaluated using a variety of realistic workloads, different testing scenarios and data center topologies. The results show that while both algorithms perform well in all cases, depending on the testing conditions, utilizing one of the two yields better results. Finally we propose a technique to select which or a combination of the two algorithms should be used by data center operators to allocate the resources of their infraststructure more effciently.
@mastersthesis{abc,
	abstract = {This thesis tackles the problem of allocating resources for networked applications in data center topologies, which is a known NP-hard decision problem. Various proposals have been introduced in the past which either make simplifying assumptions or solve a special instance of the problem. Our approach satisfies all demands on the application side as well as resource constraints of data centers, while remaining scalable. What is more, it is a generic solution that can be further tailored to specific workloads or cloud architectures.
In this report we propose and evaluate two algorithms that perform the resource allocation decision and placement for various application topologies and workloads. The first one utilizes a greedy approach of placing virtual links sequentially and back tracking when a constraint is not met. The second one clusters the application in a network optimized package, deducts a subgraph of the data center topology and makes the placement decision using a customizable heuristic. Depending on the decision, the placement takes place in the reduced subgraph. Both algorithm implementations have been evaluated using a variety of realistic workloads, different testing scenarios and data center topologies. The results show that while both algorithms perform well in all cases, depending on the testing conditions, utilizing one of the two yields better results. Finally we propose a technique to select which or a combination of the two algorithms should be used by data center operators to allocate the resources of their infraststructure more effciently.},
	author = {Spyridon Giannakakis},
	school = {60},
	title = {Design of Traffic-Aware Placement Techniques of Applications in Modern Data Centers},
	year = {2012}
}
30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada, September 2012
@inproceedings{abc,
	author = {HanBin Yoon and Justin Meza and Rachata Ausavarungnirun and Rachael Harding and Onur Mutlu},
	booktitle = {30th International IEEE Conference on Computer Design, ICCD 2012, Montreal, QC, Canada},
	title = {Row buffer locality aware caching policies for hybrid memories.},
	url = {http://dx.doi.org/10.1109/ICCD.2012.6378661},
	year = {2012}
}
International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, September 2012
@inproceedings{abc,
	author = {Nachiappan Chidambaram Nachiappan and Asit K. Mishra and Mahmut T. Kandemir and Anand Sivasubramaniam and Onur Mutlu and Chita R. Das},
	booktitle = {International Conference on Parallel Architectures and Compilation Techniques, PACT {\textquoteright}12, Minneapolis, MN},
	title = {Application-aware prefetch prioritization in on-chip networks.},
	url = {http://doi.acm.org/10.1145/2370816.2370886},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20666, September 2012
Supervised by: Prof. Gustavo Alonso
@phdthesis{abc,
	author = {Ionut Emanuel Subasu},
	school = {20666},
	title = {Multicore architectures as platform to extend database engine functionality},
	year = {2012}
}
ACM SIGCOMM 2012 Conference, SIGCOMM '12, Helsinki, August 2012
@inproceedings{abc,
	author = {George Nychis and Chris Fallin and Thomas Moscibroda and Onur Mutlu and Srinivasan Seshan},
	booktitle = {ACM SIGCOMM 2012 Conference, SIGCOMM {\textquoteright}12, Helsinki},
	title = {On-chip networks from a networking perspective: congestion and scalability in many-core interconnects.},
	url = {http://doi.acm.org/10.1145/2342356.2342436},
	year = {2012}
}
Systems Group Master's Thesis, no. 53; Department of Computer Science, August 2012
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Darko Makreshanski},
	school = {53},
	title = {Shared, Parallel Database Join on Modern Hardware},
	year = {2012}
}
The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 2012
@inproceedings{abc,
	author = {Pratanu Roy and Jens Teubner and Gustavo Alonso},
	booktitle = {The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD {\textquoteright}12, Beijing, China},
	title = {Efficient frequent item counting in multi-core hardware.},
	url = {http://doi.acm.org/10.1145/2339530.2339757},
	year = {2012}
}
Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey, August 2012
@inproceedings{abc,
	author = {Feng Niu and Ce Zhang and Christopher R{\'e} and Jude W. Shavlik},
	booktitle = {Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey},
	title = {DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference.},
	url = {http://ceur-ws.org/Vol-884/VLDS2012_p25_Niu.pdf},
	year = {2012}
}
ACM Symposium on Principles of Distributed Computing, PODC '12, Funchal, Madeira, Portugal, July 2012
@inproceedings{abc,
	author = {Joan Feigenbaum and Brighten Godfrey and Aurojit Panda and Michael Schapira and Scott Shenker and Ankit Singla},
	booktitle = {ACM Symposium on Principles of Distributed Computing, PODC {\textquoteright}12, Funchal, Madeira, Portugal},
	title = {Brief announcement: on the resilience of routing tables.},
	url = {http://doi.acm.org/10.1145/2332432.2332478},
	year = {2012}
}
Asia-Pacific Workshop on Systems, APSys '12, Seoul, Republic of Korea, July 2012
@inproceedings{abc,
	author = {Qin Yin and Timothy Roscoe},
	booktitle = {Asia-Pacific Workshop on Systems, APSys {\textquoteright}12, Seoul, Republic of Korea},
	title = {Towards realistic benchmarks for virtual infrastructure resource allocators.},
	url = {http://doi.acm.org/10.1145/2349896.2349901},
	year = {2012}
}
Systems Group Master's Thesis, no. 44; Department of Computer Science, July 2012
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Patrick Nick},
	school = {44},
	title = {Encrypting Gmail},
	year = {2012}
}
The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Jeju Island, Korea - Volume 1: Long Papers, July 2012
@inproceedings{abc,
	author = {Ce Zhang and Feng Niu and Christopher R{\'e} and Jude W. Shavlik},
	booktitle = {The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference},
	title = {Big Data versus the Crowd: Looking for Relationships in All the Right Places.},
	url = {http://www.aclweb.org/anthology/P12-1087},
	venue = {Jeju Island, Korea - Volume 1: Long Papers},
	year = {2012}
}
Systems Group Master's Thesis, no. 43; Department of Computer Science, July 2012
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Manuel Stocker},
	school = {43},
	title = {Towards a File-System Service for the Barrelfish OS},
	year = {2012}
}
Systems Group Master's Thesis, no. 62; Department of Computer Science, June 2012
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {David Gerhard},
	school = {62},
	title = {Using modularity to scale a Multikernel network stack},
	year = {2012}
}
Scientific and Statistical Database Management - 24th International Conference, SSDBM 2012, Chania, Crete, Greece, June 2012
@inproceedings{abc,
	author = {Romeo Kienzler and R{\'e}my Bruggmann and Anand Ranganathan and Nesime Tatbul},
	booktitle = {Scientific and Statistical Database Management - 24th International Conference, SSDBM 2012, Chania, Crete, Greece},
	title = {Incremental DNA Sequence Analysis in the Cloud.},
	url = {http://dx.doi.org/10.1007/978-3-642-31235-9_50},
	year = {2012}
}
39th International Symposium on Computer Architecture (ISCA 2012), Portland, OR, USA, June 2012
@inproceedings{abc,
	author = {Jamie Liu and Ben Jaiyen and Richard Veras and Onur Mutlu},
	booktitle = {39th International Symposium on Computer Architecture (ISCA 2012)},
	title = {RAIDR: Retention-aware intelligent DRAM refresh.},
	url = {http://dx.doi.org/10.1109/ISCA.2012.6237001},
	venue = {Portland, OR, USA},
	year = {2012}
}
39th International Symposium on Computer Architecture (ISCA 2012), Portland, OR, USA, June 2012
@inproceedings{abc,
	author = {Yoongu Kim and Vivek Seshadri and Donghyuk Lee and Jamie Liu and Onur Mutlu},
	booktitle = {39th International Symposium on Computer Architecture (ISCA 2012)},
	title = {A case for exploiting subarray-level parallelism (SALP) in DRAM.},
	url = {http://dx.doi.org/10.1109/ISCA.2012.6237032},
	venue = {Portland, OR, USA},
	year = {2012}
}
39th International Symposium on Computer Architecture (ISCA 2012), Portland, OR, USA, June 2012
@inproceedings{abc,
	author = {Rachata Ausavarungnirun and Kevin Kai-Wei Chang and Lavanya Subramanian and Gabriel H. Loh and Onur Mutlu},
	booktitle = {39th International Symposium on Computer Architecture (ISCA 2012)},
	title = {Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems.},
	url = {http://dx.doi.org/10.1109/ISCA.2012.6237036},
	venue = {Portland, OR, USA},
	year = {2012}
}
Systems Group Master's Thesis, no. 46; Department of Computer Science, May 2012
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Mark Nevill},
	school = {46},
	title = {An Evaluation of Capabilities for a Multikernel},
	year = {2012}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 2012
@inproceedings{abc,
	author = {Jens Teubner and Louis Woods and Chongling Nie},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA},
	title = {Skeleton automata for FPGAs: reconfiguring without reconstructing.},
	url = {http://doi.acm.org/10.1145/2213836.2213863},
	year = {2012}
}
2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), Copenhagen, Denmark, May 2012
@inproceedings{abc,
	author = {Chris Fallin and Greg Nazario and Xiangyao Yu and Kevin Kai-Wei Chang and Rachata Ausavarungnirun and Onur Mutlu},
	booktitle = {2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), Copenhagen, Denmark},
	title = {MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect.},
	url = {http://dx.doi.org/10.1109/NOCS.2012.8},
	year = {2012}
}
Systems Group Master's Thesis, no. 47; Department of Computer Science, May 2012
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Simon Gerber},
	school = {47},
	title = {Virtual Memory in a Multikernel},
	year = {2012}
}
Third Joint WOSP/SIPEW International Conference on Performance Engineering, ICPE'12, Boston, MA, April 2012
@inproceedings{abc,
	author = {Ioana Giurgiu},
	booktitle = {Third Joint WOSP/SIPEW International Conference on Performance Engineering, ICPE{\textquoteright}12, Boston, MA},
	title = {Understanding performance modeling for modular mobile-cloud applications.},
	url = {http://doi.acm.org/10.1145/2188286.2188333},
	year = {2012}
}
2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2012, Toronto, Ontario, Canada, April 2012
@inproceedings{abc,
	author = {Louis Woods and Ken Eguro},
	booktitle = {2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2012},
	title = {Groundhog - A Serial ATA Host Bus Adapter (HBA) for FPGAs.},
	url = {http://doi.ieeecomputersociety.org/10.1109/FCCM.2012.45},
	venue = {Toronto, Ontario, Canada},
	year = {2012}
}
Proceedings of the First International Workshop on Crowdsourcing Web Search, Lyon, France, April 2012
@inproceedings{abc,
	author = {Donald Kossmann},
	booktitle = {Proceedings of the First International Workshop on Crowdsourcing Web Search, Lyon, France},
	title = {Using the Crowd to Solve Database Problems.},
	url = {http://ceur-ws.org/Vol-842/crowdsearch-kossman.pdf},
	year = {2012}
}
Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 2012
@inproceedings{abc,
	author = {Ankit Singla and Chi-Yao Hong and Lucian Popa and Brighten Godfrey},
	booktitle = {Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA},
	title = {Jellyfish: Networking Data Centers Randomly.},
	url = {https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/singla},
	year = {2012}
}
Systems Group Master's Thesis, no. 41; Department of Computer Science, April 2012
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Lucas Braun},
	school = {41},
	title = {Privacy in Pub/Sub Systems},
	year = {2012}
}
IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), April 2012
@inproceedings{abc,
	author = {Peter M. Fischer and Jens Teubner},
	booktitle = {IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia)},
	title = {MXQuery with Hardware Acceleration.},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2012.130},
	year = {2012}
}
Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 2012
@inproceedings{abc,
	author = {Ankit Singla and Atul Singh and Yan Chen},
	booktitle = {Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA},
	title = {OSA: An Optical Switching Architecture for Data Center Networks with Unprecedented Flexibility.},
	url = {https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/chen_kai},
	year = {2012}
}
Systems Group Master's Thesis, no. 39; Department of Computer Science, April 2012
@mastersthesis{abc,
	author = {Raphael Tawil},
	school = {39},
	title = {Eliminating Insecure Uses of C Library Functions},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20609, April 2012
Supervised by: Prof. Donald Kossmann
Data analysis is a large field which has multiple facets and encompasses diverse techniques in a variety of domains. This thesis looks at two problems from the data analysis field: Keyword search on data warehouses and indexing of moving objects.
@phdthesis{abc,
	abstract = {Data analysis is a large field which has multiple facets and encompasses diverse techniques in a variety of domains. This thesis looks at two problems from the data analysis field: Keyword search on data warehouses and indexing of moving
objects.},
	author = {Lukas Blunschi},
	school = {20609},
	title = {Indexing and Search on Complex Data Warehouses and Rapidly-Changing Data},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20272, March 2012
Supervised by: Prof. Donald Kossmann
@phdthesis{abc,
	author = {Philipp Unterbrunner},
	school = {20272},
	title = {Elastic, Reliable, and Robust Storage and Query Processing with Crescando/RB},
	year = {2012}
}
Systems Group Master's Thesis, no. 35; Department of Computer Science, March 2012
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Gerd Zellweger},
	school = {35},
	title = {Unifying Synchronization and Events in a Multicore Operating System},
	year = {2012}
}
Systems Group Master's Thesis, no. 49; Department of Computer Science, March 2012
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Vincent Martinez},
	school = {49},
	title = {Flight Delay Prediction},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20314, March 2012
Supervised by: Prof. Donald Kossmann
@phdthesis{abc,
	author = {Stefan Hildenbrand},
	school = {20314},
	title = {Scaling Out Column Stores: Data, Queries, and Transactions},
	year = {2012}
}
ETH Zürich, Diss. Nr. 20295, March 2012
Supervised by: Prof. Donald Kossmann
@phdthesis{abc,
	author = {Martin Hentschel},
	school = {20295},
	title = {Scalable systems for data analytics and integration },
	year = {2012}
}
Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK, March 2012
@inproceedings{abc,
	author = {Jos{\'e} A. Joao and M. Aater Suleman and Onur Mutlu and Yale N. Patt},
	booktitle = {Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK},
	title = {Bottleneck identification and scheduling in multithreaded applications.},
	url = {http://doi.acm.org/10.1145/2150976.2151001},
	year = {2012}
}
2012 Design, Automation Test in Europe Conference Exhibition, DATE 2012, Dresden, Germany, March 2012
@inproceedings{abc,
	author = {Yu Cai and Erich F. Haratsch and Onur Mutlu and Ken Mai},
	booktitle = {2012 Design, Automation  Test in Europe Conference  Exhibition, DATE 2012, Dresden, Germany},
	title = {Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis.},
	url = {http://dx.doi.org/10.1109/DATE.2012.6176524},
	year = {2012}
}
Systems Group Master's Thesis, no. 38; Department of Computer Science, February 2012
Supervised by: Prof. Donald Kossmann
Databases generally adhere to the \closed-world" assumption. If data is not in the database, the database treats it as non-existent. This model works well for things like financial data or inventory. For other data types such as addresses, the data may exist but not be in the database. New systems called crowd-sourced databases now assume an \open-world" and allow a schema to contain columns or even entire tables that are filled with information that is crowd-sourced. Crowd-sourcing relations between entities is the next step in this de- velopment. This allows joins and orderings of data that is difficult to compare computationally but easily compared by humans. This master thesis investigates data-structures and algorithms to make the most out of crowd-sourced relations by exploiting the transitivity inherent to the equality and order relations. Along the way, ambiguities in the data have to be tolerated and resolved. After all, humans are far from perfect and so is the data that crowd-sourcing provides.
@mastersthesis{abc,
	abstract = {Databases generally adhere to the \closed-world" assumption. If data
is not in the database, the database treats it as non-existent. This model
works well for things like financial data or inventory. For other data types
such as addresses, the data may exist but not be in the database. New
systems called crowd-sourced databases now assume an \open-world" and
allow a schema to contain columns or even entire tables that are filled with
information that is crowd-sourced.
Crowd-sourcing relations between entities is the next step in this de-
velopment. This allows joins and orderings of data that is difficult to
compare computationally but easily compared by humans. This master
thesis investigates data-structures and algorithms to make the most out
of crowd-sourced relations by exploiting the transitivity inherent to the
equality and order relations. Along the way, ambiguities in the data have
to be tolerated and resolved. After all, humans are far from perfect and
so is the data that crowd-sourcing provides.},
	author = {Florian Widmer},
	school = {38},
	title = {Memoization of Crowd-sourced Comparisons},
	year = {2012}
}
Systems Group Master's Thesis, no. 32; Department of Computer Science, February 2012
Supervised by: Prof. Donald Kossmann
In today’s big scale applications, the database system has more and more evolved to the most limiting bottleneck. Unlike application servers and web servers, database systems are hard to scale. During the last few years a new generation of simpler but more scalable storage systems, often refered to as NoSQL (Not only SQL), recieved more attention. While these system often solve scalability issues, they sacrifice some fundamantal properties of traditional database systems like consistency or the data model. This thesis presents a relational and transactional database system which uses a key-value store instead of a hard drive disk as storage. This systems provides better elasticity and scalability than traditional disk based architecures without making any compromises in the consistency guarantees. In the first part we show that a transactional database can run on top of a key-value store without any performance or scalability penalties. We implemented such a system based on MySQL and RamCloud and provide benchmarking results of this system. In a next step we explain how this system can be made scalable and how the database system needs to be changed in order to be able to run several database instances on the same storage.
@mastersthesis{abc,
	abstract = {In today{\textquoteright}s big scale applications, the database system has more and more evolved to the most limiting bottleneck. Unlike application servers and web servers, database systems are hard to scale. During the last few years a new generation of simpler but more scalable storage systems, often refered to as NoSQL (Not only SQL), recieved more attention. While these system often solve scalability issues, they sacrifice some fundamantal properties of traditional database systems like consistency or the data model.
This thesis presents a relational and transactional database system which uses a key-value store instead of a hard drive disk as storage. This systems provides better elasticity and scalability than traditional disk based architecures without making any compromises in the consistency guarantees. In the first part we show that a transactional database can run on top of a key-value store without any performance or scalability penalties. We implemented such a system based on MySQL and RamCloud and provide benchmarking results of this system. In a next step we explain how this system can be made scalable and how the database system needs to be changed in order to be able to run several database instances on the same storage.},
	author = {Markus Pilman},
	school = {32},
	title = {Running a transactional Database on top of RamCloud},
	year = {2012}
}
Workshops Proceedings of the IEEE 28th International Conference on Data Engineering, ICDE 2012, Arlington, VA, USA, January 2012
@inproceedings{abc,
	author = {Romeo Kienzler and R{\'e}my Bruggmann and Anand Ranganathan and Nesime Tatbul},
	booktitle = {Workshops Proceedings of the IEEE 28th International Conference on Data Engineering, ICDE 2012, Arlington, VA, USA},
	title = {Stream As You Go: The Case for Incremental Data Access and Processing in the Cloud},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICDEW.2012.69},
	year = {2012}
}
IEEE Data Eng. Bull., January 2012
@inproceedings{abc,
	author = {Arvind Arasu and Spyros Blanas and Ken Eguro and Manas Joglekar and Raghav Kaushik and Donald Kossmann and Ravishankar Ramamurthy and Prasang Upadhyaya and Ramarathnam Venkatesan},
	booktitle = {IEEE Data Eng. Bull.},
	title = {Engineering Security and Performance with Cipherbase.},
	url = {http://sites.computer.org/debull/A12dec/cipher.pdf},
	year = {2012}
}
PVLDB, January 2012
@article{abc,
	author = {Georgios Giannikis and Gustavo Alonso and Donald Kossmann},
	journal = {PVLDB},
	title = {SharedDB: Killing One Thousand Queries With One Stone},
	url = {http://vldb.org/pvldb/vol5/p526_georgiosgiannikis_vldb2012.pdf},
	year = {2012}
}
January 2012
@techreport{abc,
	author = {Boris Glavic and Kyumars Sheykh Esmaili and Peter M. Fischer and Nesime Tatbul},
	title = {Ariadne: Managing Fine-Grained Provenance on Data Streams},
	year = {2012}
}
CoRR, January 2012
@inproceedings{abc,
	author = {Joan Feigenbaum and Brighten Godfrey and Aurojit Panda and Michael Schapira and Scott Shenker and Ankit Singla},
	booktitle = {CoRR},
	title = {On the Resilience of Routing Tables},
	url = {http://arxiv.org/abs/1207.3732},
	year = {2012}
}
January 2012
@techreport{abc,
	author = {Simon Peter and Rebecca Isaacs and Paul Barham and Richard Black and Timothy Roscoe},
	title = {Efficient data-parallel computing on small heterogeneous clusters},
	year = {2012}
}
Proceedings of the 3rd Asia-Pacific Workshop on Systems (APSys '12), Seoul, South Korea, January 2012
@inproceedings{abc,
	author = {Gerd Zellweger and Adrian Sch{\"u}pbach and Timothy Roscoe},
	booktitle = {Proceedings of the 3rd Asia-Pacific Workshop on Systems (APSys {\textquoteright}12), Seoul, South Korea},
	title = {Unifying Synchronization and Events in a Multicore OS},
	year = {2012}
}
January 2012
@techreport{abc,
	author = {Tahmineh Sanamrad and Daniel Widmer and Donald Kossmann},
	title = {My Private Google Calendar},
	year = {2012}
}
CoRR, January 2012
@inproceedings{abc,
	author = {Lukas Blunschi and Claudio Jossen and Donald Kossmann and Magdalini Mori and Kurt Stockinger},
	booktitle = {CoRR},
	title = {SODA: Generating SQL for Business Users},
	url = {http://arxiv.org/abs/1207.0134},
	year = {2012}
}
January 2012
@techreport{abc,
	author = {Cagri Balkesen and Jens Thilo Teubner and Gustavo Alonso and M. Tamer Ozsu},
	title = {Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware},
	year = {2012}
}
IEEE Micro, January 2012
@inproceedings{abc,
	author = {Boris Grot and Joel Hestness and Stephen W. Keckler and Onur Mutlu},
	booktitle = {IEEE Micro},
	title = {A QoS-Enabled On-Die Interconnect Fabric for Kilo-Node Chips.},
	url = {http://doi.ieeecomputersociety.org/10.1109/MM.2012.18},
	year = {2012}
}
Int. J. Semantic Web Inf. Syst., January 2012
@inproceedings{abc,
	author = {Feng Niu and Ce Zhang and Christopher R{\'e} and Jude W. Shavlik},
	booktitle = {Int. J. Semantic Web Inf. Syst.},
	title = {Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference.},
	url = {http://dx.doi.org/10.4018/jswis.2012070103},
	year = {2012}
}
Proceedings of the 8th International ICST Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities (TridentCom 2012), Thessaloniki, Greece, January 2012
@inproceedings{abc,
	author = {Qin Yin and Timothy Roscoe},
	booktitle = {Proceedings of the 8th International ICST Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities (TridentCom 2012), Thessaloniki, Greece},
	title = {VF2x: Fast, efficient virtual network mapping for real testbed workloads},
	year = {2012}
}
January 2012
@techreport{abc,
	author = {Anja Gruenheid and Donald Kossmann and Sukriti Ramesh and Florian Widmer},
	title = {Crowdsourcing Entity Resolution: When is A=B?},
	year = {2012}
}
15th International Conference on Extending Database Technology, EDBT '12, Berlin, Germany, January 2012
@inproceedings{abc,
	author = {Irina Botan and Peter M. Fischer and Donald Kossmann and Nesime Tatbul},
	booktitle = {15th International Conference on Extending Database Technology, EDBT {\textquoteright}12, Berlin, Germany},
	title = {Transactional Stream Processing},
	url = {http://doi.acm.org/10.1145/2247596.2247622},
	year = {2012}
}
ACM Trans. Comput. Syst., January 2012
@article{abc,
	author = {Adrian Sch{\"u}pbach and Andrew Baumann and Timothy Roscoe and Simon Peter},
	journal = {ACM Trans. Comput. Syst.},
	title = {A Declarative Language Approach to Device Configuration.},
	url = {http://doi.acm.org/10.1145/2110356.2110361},
	year = {2012}
}
Computer Architecture Letters, January 2012
@inproceedings{abc,
	author = {Justin Meza and Jichuan Chang and HanBin Yoon and Onur Mutlu and Parthasarathy Ranganathan},
	booktitle = {Computer Architecture Letters},
	title = {Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management.},
	url = {http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.2},
	year = {2012}
}
PVLDB, January 2012
@inproceedings{abc,
	author = {Ahmet Sacan and Nesime Tatbul},
	booktitle = {PVLDB},
	title = {Letter from the the Associate Editors.},
	url = {http://vldb.org/pvldb/vol5/frontmatterVol5No10.pdf},
	year = {2012}
}
PVLDB, January 2012
@inproceedings{abc,
	author = {Lukas Blunschi and Claudio Jossen and Donald Kossmann and Magdalini Mori and Kurt Stockinger},
	booktitle = {PVLDB},
	title = {SODA: Generating SQL for Business Users},
	url = {http://vldb.org/pvldb/vol5/p932_lukasblunschi_vldb2012.pdf},
	year = {2012}
}
Proceedings of the 2012 Joint EDBT/ICDT Workshops, Berlin, Germany, January 2012
@inproceedings{abc,
	author = {Simon Loesing and Martin Hentschel and Tim Kraska and Donald Kossmann},
	booktitle = {Proceedings of the 2012 Joint EDBT/ICDT Workshops, Berlin, Germany},
	title = {Stormy: An Elastic and Highly Available Streaming Service in the Cloud},
	url = {http://doi.acm.org/10.1145/2320765.2320789},
	year = {2012}
}
PVLDB, January 2012
@inproceedings{abc,
	author = {Gustavo Alonso and Juliana Freire},
	booktitle = {PVLDB},
	title = {Letter from the the Associate Editors.},
	url = {http://vldb.org/pvldb/vol5/frontmatterVol5No7.pdf},
	year = {2012}
}
IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), January 2012
@inproceedings{abc,
	author = {Claudio Jossen and Lukas Blunschi and Magdalini Mori and Donald Kossmann and Kurt Stockinger},
	booktitle = {IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia)},
	title = {The Credit Suisse Meta-data Warehouse},
	url = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2012.41},
	year = {2012}
}
CoRR, January 2012
@article{abc,
	author = {Georgios Giannikis and Gustavo Alonso and Donald Kossmann},
	journal = {CoRR},
	title = {SharedDB: Killing One Thousand Queries With One Stone},
	url = {http://arxiv.org/abs/1203.0056},
	year = {2012}
}
Systems Group Master's Thesis, no. 40; Department of Computer Science, ETH Zurich, Jan. 2012. ; Department of Computer Science, January 2012
Supervised by: Prof. Nesime Tatbul
@mastersthesis{abc,
	author = {Romeo Kienzler},
	school = {40; Department of Computer Science, ETH Zurich, Jan. 2012. },
	title = {A Stream-based Approach to Massively Parallel DNA Sequence Analysis in the Cloud},
	year = {2012}
}
January 2012
@techreport{abc,
	author = {Gustavo Alonso and Donald Kossmann and Tudor-Ioan Salomie and Andreas Schmidt},
	title = {Shared Scans on Main Memory Column Stores},
	year = {2012}
}
Proceedings of the 2nd workshop on Systems for Future Multi-core Architectures (SFMA'12), Bern, Switzerland, January 2012
@inproceedings{abc,
	author = {Jana Giceva and Adrian Sch{\"u}pbach and Gustavo Alonso and Timothy Roscoe},
	booktitle = {Proceedings of the 2nd workshop on Systems for Future Multi-core Architectures (SFMA{\textquoteright}12), Bern, Switzerland},
	title = {Towards Database / Operating System Co-Design},
	year = {2012}
}
ACM Trans. Comput. Syst., January 2012
@inproceedings{abc,
	author = {Eiman Ebrahimi and Chang Joo Lee and Onur Mutlu and Yale N. Patt},
	booktitle = {ACM Trans. Comput. Syst.},
	title = {Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems.},
	url = {http://doi.acm.org/10.1145/2166879.2166881},
	year = {2012}
}
IEEE Data Eng. Bull., January 2012
@inproceedings{abc,
	author = {Tahmineh Sanamrad and Patrick Nick and Daniel Widmer and Donald Kossmann and Lucas Braun},
	booktitle = {IEEE Data Eng. Bull.},
	title = {My Private Google Calendar and GMail.},
	url = {http://sites.computer.org/debull/A12dec/my-private.pdf},
	year = {2012}
}

2011

Systems Group Master's Thesis, no. 16; Department of Computer Science, December 2011
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Ueli Ehrbar},
	school = {16},
	title = {Event Consolidation and Analysis Tool for E-Banking},
	year = {2011}
}
44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 2011
@inproceedings{abc,
	author = {Veynu Narasiman and Michael Shebanow and Chang Joo Lee and Rustam Miftakhutdinov and Onur Mutlu and Yale N. Patt},
	booktitle = {44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil},
	title = {Improving GPU performance via large warps and two-level warp scheduling.},
	url = {http://doi.acm.org/10.1145/2155620.2155656},
	year = {2011}
}
44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 2011
@inproceedings{abc,
	author = {Sai Prashanth Muralidhara and Lavanya Subramanian and Onur Mutlu and Mahmut T. Kandemir and Thomas Moscibroda},
	booktitle = {44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil},
	title = {Reducing memory interference in multicore systems via application-aware memory channel partitioning.},
	url = {http://doi.acm.org/10.1145/2155620.2155664},
	year = {2011}
}
44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 2011
@inproceedings{abc,
	author = {Eiman Ebrahimi and Rustam Miftakhutdinov and Chris Fallin and Chang Joo Lee and Jos{\'e} A. Joao and Onur Mutlu and Yale N. Patt},
	booktitle = {44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil},
	title = {Parallel application memory scheduling.},
	url = {http://doi.acm.org/10.1145/2155620.2155663},
	year = {2011}
}
ETH Zürich, Diss. Nr. 20166, December 2011
Supervised by: Prof. Donald Kossmann
@phdthesis{abc,
	author = {Ghislain Fourny},
	school = {20166},
	title = {Flexible Models for Programming the Web},
	year = {2011}
}
ETH Zürich, Diss. Nr. 20172, December 2011
Supervised by: Prof. Gustavo Alonso
@phdthesis{abc,
	author = {Michael Duller},
	school = {20172},
	title = {Management and Federation of Stream Processing Applications},
	year = {2011}
}
Systems Group Master's Thesis, no. 30; Department of Computer Science, December 2011
Supervised by: Prof. Donald Kossmann
This master thesis aims to add partial live migration features in Crescando[17]. Crescando is a scalable, distributed relational table implementation based on parallel, collaborative scans in memory. These features developed in this thesis provide the building block for implementing elastic scalability and high availability in Crescando. Elastic scalability refers to adding or removing storage nodes to a cluster without downtime. High availability refers to avoiding unplanned outages by eliminating single points of failure. One of the methodologies for providing high availability is fault tolerance by replication.[10] Both, elastic scalability and high availability, require an efficient method to copy or move data across storage nodes, which this master thesis provides. The problem is tackled in a black-box approach. Crescando external user interface is used to solve the problem, rather than altering its implementation. Crescando’s simple operations (Select, Insert, Delete) are used as the elementary units to provide the functionality of copying and moving data across nodes. One of the challenges of building such a system is migrating the contents of a relational table with minimal impact on the whole system availability and performance. Optimizations are incorporated to achieve efficient data transfer such that data transfer rate saturates a gigabit Ethernet interface. The system interrupt duration is minimized to the period required for data transfer. Moreover certain consistency guarantees must be provided by the solution. Our solution guarantees linearizability[9], a well-known strong consistency guarantee. The migration system developed in this thesis is employed by a higher level layer known as Rubberband[16]. Rubberband implements a well-known replication scheme, known as successor-list replication[14]. Rubberband instructs appropriate nodes in a dynamic set of nodes to shuffle data using the migration system developed in this thesis. They are instructed to shuffle data in order to maintain successor-list replication scheme as storage nodes join and part the system.
@mastersthesis{abc,
	abstract = {This master thesis aims to add partial live migration features in Crescando[17]. Crescando is a scalable, distributed relational table implementation based on parallel, collaborative scans in memory. These features developed in this thesis provide the building block for implementing elastic scalability and high availability in Crescando. Elastic scalability refers to adding or removing storage nodes to a cluster without downtime. High availability refers to avoiding unplanned outages by eliminating single points of failure. One of the methodologies for providing high availability is fault tolerance by replication.[10] Both, elastic scalability and high availability, require an efficient method to copy or move data across storage nodes, which this master thesis provides.
The problem is tackled in a black-box approach. Crescando external user interface is used to solve the problem, rather than altering its implementation. Crescando{\textquoteright}s simple operations (Select, Insert, Delete) are used as the elementary units to provide the functionality of copying and moving data across nodes. One of the challenges of building such a system is migrating the contents of a relational table with minimal impact on the whole system availability and performance. Optimizations are incorporated to achieve efficient data transfer such that data transfer rate saturates a gigabit Ethernet interface. The system interrupt duration is minimized to the period required for data transfer. Moreover certain consistency guarantees must be provided by the solution. Our solution guarantees linearizability[9], a well-known strong consistency guarantee.
The migration system developed in this thesis is employed by a higher level layer known as Rubberband[16]. Rubberband implements a well-known replication scheme, known as successor-list replication[14]. Rubberband instructs appropriate nodes in a dynamic set of nodes to shuffle data using the migration system developed in this thesis. They are instructed to shuffle data in order to maintain successor-list replication scheme as storage nodes join and part the system.},
	author = {Khalid Ashmawy},
	school = {30},
	title = {Partial live migration in scan-based database systems},
	year = {2011}
}
Proceedings of the 2011 ACM SIGSPATIAL International Workshop on GeoStreaming, IWGS 2011, Chicago, IL, USA, November 2011
@inproceedings{abc,
	author = {Asli {\"O}zal and Anand Ranganathan and Nesime Tatbul},
	booktitle = {Proceedings of the 2011 ACM SIGSPATIAL International Workshop on GeoStreaming, IWGS 2011},
	title = {Real-time route planning with stream processing systems: a case study for the city of Lucerne.},
	url = {http://doi.acm.org/10.1145/2064959.2064965},
	venue = {Chicago, IL, USA},
	year = {2011}
}
Tenth ACM Workshop on Hot Topics in Networks (HotNets-X), HOTNETS '11, Cambridge, MA, November 2011
@inproceedings{abc,
	author = {Ali Ghodsi and Scott Shenker and Teemu Koponen and Ankit Singla and Barath Raghavan and James R. Wilcox},
	booktitle = {Tenth ACM Workshop on Hot Topics in Networks (HotNets-X), HOTNETS {\textquoteright}11, Cambridge, MA},
	title = {Information-centric networking: seeing the forest for the trees.},
	url = {http://doi.acm.org/10.1145/2070562.2070563},
	year = {2011}
}
Tenth ACM Workshop on Hot Topics in Networks (HotNets-X), HOTNETS '11, Cambridge, MA, November 2011
@inproceedings{abc,
	author = {Ali Ghodsi and Scott Shenker and Teemu Koponen and Ankit Singla and Barath Raghavan and James R. Wilcox},
	booktitle = {Tenth ACM Workshop on Hot Topics in Networks (HotNets-X), HOTNETS {\textquoteright}11, Cambridge, MA},
	title = {Intelligent design enables architectural evolution.},
	url = {http://doi.acm.org/10.1145/2070562.2070565},
	year = {2011}
}
Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 2011
@inproceedings{abc,
	author = {{\'U}lfar Erlingsson and Marcus Peinado and Simon Peter and Mihai Budiu},
	booktitle = {Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal},
	title = {Fay: extensible distributed tracing from kernels to clusters.},
	url = {http://doi.acm.org/10.1145/2043556.2043585},
	year = {2011}
}
Systems Group Master's Thesis, no. 31; Department of Computer Science, September 2011
Supervised by: Prof. Donald Kossmann
Keyword search systems became very popular in the last decade because they allow people to access and find information in an easy and comfortable way. However, to keep the Quality of these systems on a high level, the system designers have to address several problems like changing requirements of the users or changes in the underlying data structure. We believe that user feedback is an elegant way to handle some of these problems and keep the system flexible and adaptable for future changes in the requirements. To prove this concept this report explains the design and implementation of several user feedback features which are added to an existing keyword search system over data warehouses. We will show by example how these features can be used by the users and how they improve the usability and flexibility of the given system. At the end of this report we will show proposals for other feedback features and how the implemented features could be further improved.
@mastersthesis{abc,
	abstract = {Keyword search systems became very popular in the last decade because they allow people to access and find information in an easy and comfortable way. However, to keep the Quality of these systems on a high level, the system designers have to address several problems like changing requirements of the users or changes in the underlying data structure. We believe that user feedback is an elegant way to handle some of these problems and keep the system flexible and adaptable for future changes in the requirements. To prove this concept this report explains the design and implementation of several user feedback features which are added to an existing keyword search system over data warehouses. We will show by example how these features can be used by the users and how they improve the usability and flexibility of the given system. At the end of this report we will show proposals for other feedback features and how the implemented features could be further improved.},
	author = {Mike Klausmann},
	school = {31},
	title = {User Feedback Integration - Incremental Improvement},
	year = {2011}
}
Systems Group Master's Thesis, no. 27; Department of Computer Science, August 2011
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Willy Lai},
	school = {27},
	title = {Data Warehouse Query Log Analysis using MapReduce},
	year = {2011}
}
ETH Zürich, Diss. Nr. 19907, August 2011
Supervised by: Prof. Donald Kossmann
@phdthesis{abc,
	author = {Kyumars Sheykh Esmaili},
	school = {19907},
	title = {Data Stream Processing in Complex Applications},
	year = {2011}
}
Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, Revised Selected Papers, Part II, August 2011
@inproceedings{abc,
	author = {Romeo Kienzler and R{\'e}my Bruggmann and Anand Ranganathan and Nesime Tatbul},
	booktitle = {Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France},
	title = {Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based Approach.},
	url = {http://dx.doi.org/10.1007/978-3-642-29740-3_52},
	venue = {Revised Selected Papers, Part II},
	year = {2011}
}
Systems Group Master's Thesis, no. 28 ; Department of Computer Science, July 2011
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Shenji Sch{\"a}ppi},
	school = {28 },
	title = {Evaluating Index Architectures in the Cloud},
	year = {2011}
}
Proceedings of the Fifth ACM International Conference on Distributed Event-Based Systems, DEBS 2011, New York, NY, USA, July 2011
@inproceedings{abc,
	author = {Nihal Dindar and Peter M. Fischer and Merve Soner and Nesime Tatbul},
	booktitle = {Proceedings of the Fifth ACM International Conference on Distributed Event-Based Systems, DEBS 2011, New York, NY, USA},
	title = {Efficiently correlating complex events over live and archived data streams.},
	url = {http://doi.acm.org/10.1145/2002259.2002293},
	year = {2011}
}
Systems Group Master's Thesis, no. 13; Department of Computer Science, June 2011
Supervised by: Prof. Nesime Tatbul
@mastersthesis{abc,
	author = {Burak Kalay},
	school = {13},
	title = {Tools and Techniques for Exploring Execution Model Relationships across Heterogeneous Stream Processing Engines},
	year = {2011}
}
Proceedings of the 8th International Conference on Autonomic Computing, ICAC 2011, Karlsruhe, Germany, June 2011
@inproceedings{abc,
	author = {Howard David and Chris Fallin and Eugene Gorbatov and Ulf R. Hanebutte and Onur Mutlu},
	booktitle = {Proceedings of the 8th International Conference on Autonomic Computing, ICAC 2011, Karlsruhe, Germany},
	title = {Memory power management via dynamic voltage/frequency scaling.},
	url = {http://doi.acm.org/10.1145/1998582.1998590},
	year = {2011}
}
Systems Group Master's Thesis, no. 12 ; Department of Computer Science, June 2011
Supervised by: Prof. Nesime Tatbul
@mastersthesis{abc,
	author = {Asli {\"O}zal},
	school = {12 },
	title = {Real-Time Route Planning with Stream Processing Systems: A Case Study for the City of Luzern},
	year = {2011}
}
Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA, June 2011
@inproceedings{abc,
	author = {Onur Mutlu},
	booktitle = {Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA},
	title = {Memory systems in the many-core era: challenges, opportunities, and solution directions.},
	url = {http://doi.acm.org/10.1145/1993478.1993489},
	year = {2011}
}
Systems Group Master's Thesis, no. 29; Department of Computer Science, June 2011
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Daniel Thomas},
	school = {29},
	title = {Spam-Free Internet Search With SocialSearch},
	year = {2011}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 2011
@inproceedings{abc,
	author = {Kyumars Sheykh Esmaili and Tahmineh Sanamrad and Peter M. Fischer and Nesime Tatbul},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece},
	title = {Changing flights in mid-air: a model for safely modifying continuous queries.},
	url = {http://doi.acm.org/10.1145/1989323.1989388},
	year = {2011}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 2011
@inproceedings{abc,
	author = {Michael J. Franklin and Donald Kossmann and Tim Kraska and Sukriti Ramesh and Reynold Xin},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece},
	title = {CrowdDB: answering queries with crowdsourcing.},
	url = {http://doi.acm.org/10.1145/1989323.1989331},
	year = {2011}
}
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 2011
@inproceedings{abc,
	author = {Jens Teubner and Ren{\'e} M{\"u}ller},
	booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece},
	title = {How soccer players would do stream joins.},
	url = {http://doi.acm.org/10.1145/1989323.1989389},
	year = {2011}
}
38th International Symposium on Computer Architecture (ISCA 2011), San Jose, CA, USA, June 2011
@inproceedings{abc,
	author = {Eiman Ebrahimi and Chang Joo Lee and Onur Mutlu and Yale N. Patt},
	booktitle = {38th International Symposium on Computer Architecture (ISCA 2011)},
	title = {Prefetch-aware shared resource management for multi-core systems.},
	url = {http://doi.acm.org/10.1145/2000064.2000081},
	venue = {San Jose, CA, USA},
	year = {2011}
}
38th International Symposium on Computer Architecture (ISCA 2011), San Jose, CA, USA, June 2011
@inproceedings{abc,
	author = {Boris Grot and Joel Hestness and Stephen W. Keckler and Onur Mutlu},
	booktitle = {38th International Symposium on Computer Architecture (ISCA 2011)},
	title = {Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees.},
	url = {http://doi.acm.org/10.1145/2000064.2000112},
	venue = {San Jose, CA, USA},
	year = {2011}
}
3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud'11, Portland, OR, USA, June 2011
@inproceedings{abc,
	author = {Ankit Singla and Chi-Yao Hong and Lucian Popa and Brighten Godfrey},
	booktitle = {3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud{\textquoteright}11, Portland, OR, USA},
	title = {Jellyfish: Networking Data Centers, Randomly.},
	url = {https://www.usenix.org/conference/hotcloud11/jellyfish-networking-data-centers-randomly},
	year = {2011}
}
19th International Workshop on Quality of Service, IWQoS 2011, San Jose, California, USA, Proceedings of the19th International Workshop on Quality of Service, IWQoS 2011, San Jose, California, USA, 6-7 June 2011. , June 2011
@inproceedings{abc,
	author = {Ercan Ucan and Timothy Roscoe},
	booktitle = {19th International Workshop on Quality of Service, IWQoS 2011, San Jose, California, USA},
	title = {Dexferizer: A service for data transfer optimization.},
	url = {http://dx.doi.org/10.1109/IWQOS.2011.5931343},
	venue = {Proceedings of the19th International Workshop on Quality of Service, IWQoS 2011, San Jose, California, USA, 6-7 June 2011. },
	year = {2011}
}
Systems Group Master's Thesis, no. 14 ; Department of Computer Science, June 2011
Supervised by: Prof. Nesime Tatbul
@mastersthesis{abc,
	author = {Ozan Kaya},
	school = {14 },
	title = {Performance Benchmarking of Stream Processing Architectures with Persistence and Enrichment Support},
	year = {2011}
}
NOCS 2011, Fifth ACM/IEEE International Symposium on Networks-on-Chip, Pittsburgh, Pennsylvania, USA, May 2011
@inproceedings{abc,
	author = {Michael Papamichael and James C. Hoe and Onur Mutlu},
	booktitle = {NOCS 2011, Fifth ACM/IEEE International Symposium on Networks-on-Chip, Pittsburgh, Pennsylvania, USA},
	title = {FIST: A fast, lightweight, FPGA-friendly packet latency estimator for NoC modeling in full-system simulations.},
	year = {2011}
}
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 2011
@inproceedings{abc,
	author = {Licheng Chen and Yongbing Huang and Yungang Bao and Onur Mutlu and Guangming Tan and Mingyu Chen},
	booktitle = {Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA},
	title = {Poster: revisiting virtual channel memory for performance and fairness on multi-core architecture.},
	url = {http://doi.acm.org/10.1145/1995896.1995962},
	year = {2011}
}
Systems Group Master's Thesis, no. 8; Department of Computer Science, May 2011
Supervised by: Prof. Gustavo Alonso
@mastersthesis{abc,
	author = {Jana Giceva},
	school = {8},
	title = {Database-Operating System Co-Design},
	year = {2011}
}
Systems Group Master's Thesis, no. 11 ; Department of Computer Science, May 2011
Supervised by: Prof. Donald Kossmann
We introduce two approaches to augmenting English-Arabic statistical machine translation (SMT) with linguistic knowledge. The first approach improves SMT by adding linguistically motivated syntactic features to particular phrases. These added features are based on the English syntactic information, namely part-of-speech tags and dependency parse trees. We achieved improvements of 0.2 and 0.6 in BLEU score on two different data sets over the state-of-the-art SMT baseline system. The second approach improves morphological agreement in machine translation output through post-processing. Our method uses the projection of the English dependency parse tree onto the Arabic sentence in addition to the Arabic morphological analysis in order to extract the agreement relations between words in the Arabic sentence. Afterwards, classifiers for individual morphological features are trained using syntactic and morphological information from both the source and target languages. The predicted morphological features are then used to generate the correct surface forms. Our method achieves a statistically significant improvement over the baseline system according to human evaluation.
@mastersthesis{abc,
	abstract = {We introduce two approaches to augmenting English-Arabic statistical machine translation
(SMT) with linguistic knowledge. The first approach improves SMT by adding linguistically
motivated syntactic features to particular phrases. These added features are based on the English
syntactic information, namely part-of-speech tags and dependency parse trees. We achieved
improvements of 0.2 and 0.6 in BLEU score on two different data sets over the state-of-the-art
SMT baseline system. The second approach improves morphological agreement in machine
translation output through post-processing. Our method uses the projection of the English dependency
parse tree onto the Arabic sentence in addition to the Arabic morphological analysis
in order to extract the agreement relations between words in the Arabic sentence. Afterwards,
classifiers for individual morphological features are trained using syntactic and morphological
information from both the source and target languages. The predicted morphological features
are then used to generate the correct surface forms. Our method achieves a statistically significant
improvement over the baseline system according to human evaluation.},
	author = {Soha Sultan},
	school = {11 },
	title = {Applying Morphology to English-Arabic Statistical Machine Translation},
	year = {2011}
}
Systems Group Master's Thesis, no. 10 ; Department of Computer Science, May 2011
Supervised by: Prof. Timothy Roscoe
@mastersthesis{abc,
	author = {Kaveh Razavi},
	school = {10 },
	title = {Performance isolation on multicore hardware},
	year = {2011}
}
ETH Zürich, Diss. Nr. 19694, May 2011
Supervised by: Prof. Donald Kossmann
A variety of applications require low-latency processing of data that comes in highlydynamic streams of items. These applications are implemented using Data Stream Management Systems (DSMSs). More recently, new application domains like real-time business intelligence turned to the “on-the-fly” processing model employed by these systems for a solution to their challenges. As a result, the requirements imposed on the DSMSs have become more complex: e.g., mechanisms for correlating data streams with stored information or near-real time complex analysis of large portions of streaming data. In order to meet the evolving requirements of modern streaming applications, a clean, flexible and high performance DSMS design is required. Although many system implementations were proposed, none of them offers a clean, systematic approach to data storage management. Rather, the storage manager is usually tightly coupled with the continuous query execution engine. This design decision limits the possibility for further performance improvement and severely restricts the flexibility necessary to accommodate new application requirements. Moreover, today, there is no standard for querying streams and, as a result, each DSMS exposes its own execution semantics, making the implementation of the new requirements even more challenging. This dissertation investigates the design and implementation of a general-purpose storage management framework for Data Stream Management Systems, that we name SMS (Storage Manager for Streams). The ultimate goal of this framework is to provide a general, clean, flexible and high-performance storage management system which could be virtually “plugged” into any DSMS. In order to achieve this goal, in this work, we combine the experience gained over decades of research on Database Management Systems with the high-performance mechanisms employed by the Data Stream Management Systems. Following the database systems architecture design, this framework is based on the principle of separating concerns: the query processor is decoupled from the storage manager. As such, the storage system obtains the flexibility necessary to accommodate new requirements, behind a general interface. Moreover, it can provide specialized store implementations tailored to the particular requirements of the applications, which is key to achieving good performance. In this respect, an important contribution of the framework is the reuse of the access patterns of the continuous query operators to tune the stores’ implementation and as such, to speed up the access on materialized data. In addition, the unified transactional model proposed in this dissertation makes minimal extensions to the traditional transactional model in order to accommodate streams and continuous queries. As a result, it offers a clean semantics for continuous query execution over arbitrary combinations of data sources (streaming and stored) in the presence of concurrent access and failures. And even more, it can be used to explain the transactional behavior of state-of-the-art DSMSs. A series of experiments are conducted using the Linear Road streaming benchmark’s implementation in MXQuery (a Java-based open-source XQuery engine, extended with window functions for continuous processing). MXQuery uses SMS for all its data storage related tasks. Our experiments show that the response time of the continuous queries can indeed be lowered if the store implementations are tuned according to the access patterns of the continuous query operators. Moreover, a transaction manager implementing the unified transactional model and designed as an additional component between the access and storage layers of SMS provides correctness and reliability for the Linear Road application with practically no performance penalty. As such, the experimental results indicate that a storage manager built on these ideas is a promising approach.
@phdthesis{abc,
	abstract = {A variety of applications require low-latency processing of data that comes in highlydynamic streams of items. These applications are implemented using Data Stream Management Systems (DSMSs). More recently, new application domains like real-time business intelligence turned to the {\textquotedblleft}on-the-fly{\textquotedblright} processing model employed by these systems for a solution to their challenges. As a result, the requirements imposed on the DSMSs have become more complex: e.g., mechanisms for correlating data streams with stored information or near-real time complex analysis of large portions of streaming data.
In order to meet the evolving requirements of modern streaming applications, a clean, flexible and high performance DSMS design is required. Although many system implementations were proposed, none of them offers a clean, systematic approach to data storage management. Rather, the storage manager is usually tightly coupled with the continuous query execution engine. This design decision limits the possibility for further performance improvement and severely restricts the flexibility necessary to accommodate new application requirements. Moreover, today, there is no standard for querying
streams and, as a result, each DSMS exposes its own execution semantics, making the implementation of the new requirements even more challenging.
This dissertation investigates the design and implementation of a general-purpose storage management framework for Data Stream Management Systems, that we name SMS (Storage Manager for Streams). The ultimate goal of this framework is to provide a general, clean, flexible and high-performance storage management system which could be virtually {\textquotedblleft}plugged{\textquotedblright} into any DSMS. In order to achieve this goal, in this work, we combine the experience gained over decades of research on Database Management Systems with the high-performance mechanisms employed by the Data Stream Management Systems.
Following the database systems architecture design, this framework is based on the principle of separating concerns: the query processor is decoupled from the storage manager. As such, the storage system obtains the flexibility necessary to accommodate new requirements, behind a general interface. Moreover, it can provide specialized store implementations tailored to the particular requirements of the applications, which is key to achieving good performance. In this respect, an important contribution of the framework is the reuse of the access patterns of the continuous query operators to tune the stores{\textquoteright} implementation and as such, to speed up the access on materialized data. In addition, the unified transactional model proposed in this dissertation makes minimal extensions to the traditional transactional model in order to accommodate streams and continuous queries. As a result, it offers a clean semantics for continuous query execution over arbitrary combinations of data sources (streaming and stored) in the presence of concurrent access and failures. And even more, it can be used to explain the transactional behavior of state-of-the-art DSMSs.
A series of experiments are conducted using the Linear Road streaming benchmark{\textquoteright}s implementation in MXQuery (a Java-based open-source XQuery engine, extended with window functions for continuous processing). MXQuery uses SMS for all its data storage related tasks. Our experiments show that the response time of the continuous queries can indeed be lowered if the store implementations are tuned according to the access patterns of the continuous query operators. Moreover, a transaction manager implementing the unified transactional model and designed as an additional component between the access and storage layers of SMS provides correctness and reliability for the Linear Road application with practically no performance penalty. As such, the experimental results indicate that a storage manager built on these ideas is a promising approach.},
	author = {Irina Botan},
	school = {19694},
	title = {Storage Management Techniques for Stream Processing},
	year = {2011}
}
Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, Hannover, Germany, April 2011
@inproceedings{abc,
	author = {Maximilian Ahrens and Gustavo Alonso},
	booktitle = {Proceedings of the 27th International Conference on Data Engineering, ICDE 2011},
	title = {Relational databases, virtualization, and the cloud.},
	url = {http://dx.doi.org/10.1109/ICDE.2011.5767966},
	venue = {Hannover, Germany},
	year = {2011}
}
Systems Group Master's Thesis, no. 2; Department of Computer Science, April 2011
Supervised by: Prof. Donald Kossmann
Despite the advances in the areas of databases and information retrieval, there still remain certain types of queries that are difficult to answer using machines alone. Such queries require human interaction to either provide data that is not readily available to machines or to gain more information from existing electronic data. CrowdDB is a database system that enables difficult queries to be answered by using crowdsourcing to integrate human knowledge with electronically available data. To a large extent, the concepts and capabilities of traditional database systems are leveraged in CrowdDB. Despite the commonalities, since CrowdDB deals with procuring and utilizing human input, several existing capabilities of traditional database systems require modifications and extensions. Much unlike electronically available data, human input provided by crowdsourcing is unbounded and virtually infinite. Accordingly, CrowdDB is a system based on an open-world assumption. An extension of SQL, termed as Crowd- SQL, is used to model data and manipulate it. CrowdSQL is also used as the language to express complex queries on the integrated data sources. Furthermore, interaction with the crowd in CrowdDB requires an additional component that governs automatic user interface generation, based on available schemas and queries. Also, performance acquires a new meaning in the context of a system such as CrowdDB. Response time (efficiency), quality (effectiveness) and cost (in $) in CrowdDB are dependent on a number of different parameters including the availability of the crowd, financial rewards for tasks and state of the crowdsourcing platform. In this thesis, we propose the design, architecture and functioning of CrowdDB. In addition, we present the details of building such a system on an existing Java-based database, H2. The design and functionalities of CrowdDB have also been presented in [13].
@mastersthesis{abc,
	abstract = {Despite the advances in the areas of databases and information retrieval, there still remain certain
types of queries that are difficult to answer using machines alone. Such queries require human interaction
to either provide data that is not readily available to machines or to gain more information from
existing electronic data.
CrowdDB is a database system that enables difficult queries to be answered by using crowdsourcing
to integrate human knowledge with electronically available data. To a large extent, the concepts
and capabilities of traditional database systems are leveraged in CrowdDB. Despite the commonalities,
since CrowdDB deals with procuring and utilizing human input, several existing capabilities of
traditional database systems require modifications and extensions. Much unlike electronically available
data, human input provided by crowdsourcing is unbounded and virtually infinite. Accordingly,
CrowdDB is a system based on an open-world assumption. An extension of SQL, termed as Crowd-
SQL, is used to model data and manipulate it. CrowdSQL is also used as the language to express
complex queries on the integrated data sources. Furthermore, interaction with the crowd in CrowdDB
requires an additional component that governs automatic user interface generation, based on available
schemas and queries. Also, performance acquires a new meaning in the context of a system such as
CrowdDB. Response time (efficiency), quality (effectiveness) and cost (in $) in CrowdDB are dependent
on a number of different parameters including the availability of the crowd, financial rewards for
tasks and state of the crowdsourcing platform. In this thesis, we propose the design, architecture and
functioning of CrowdDB. In addition, we present the details of building such a system on an existing
Java-based database, H2. The design and functionalities of CrowdDB have also been presented in
[13].
},
	author = {Sukriti Ramesh},
	school = {2},
	title = {CrowdDB \&$\#$150; Answering Queries with Crowdsourcing},
	year = {2011}
}
Systems Group Master's Thesis, no. 7; Department of Computer Science, April 2011
Supervised by: Prof. Donald Kossmann
The web has become a real-time communication medium, used by a large amount of people, in ever-increasing parts of their daily life. This new usage pattern gives advertisers and marketeers great opportunities to learn about their customers' temporal interests, thoughts and current context. Despite it's a well known fact that this information is extremely valuable for advertising and product recommendation, online advertising is adapting only slowly. This is mainly due to the fact that it's not clear what information is valuable to use, the huge amount of produced data and the lack of effcient models to process this data. This thesis describes an approach to implement Scalable Real-Time Product Recommendation based on Users Activity in a Social Network. The products are taken from Amazon.com and the used social network is the microblogging platform Twitter. It presents an implementation of this approach on top of the key-value database Cassandra, using a system called Triggy. Triggy extends Cassandra with incremental Map-Reduce tasks for push-style data processing. Use cases that require high-performance analysis of large amounts of data, are the showpiece of every stream processing engine. These engines are built to process massive amounts of data in very short time. Therefore, this thesis contains a comparison between four state-of-the-art distributed stream processing engines and the implementation with Triggy. It's showed that the analyzed use case has various properties that make it's implementation in a stream processing engine impossible. Finally, a demo application is presented, to show the described approach.
@mastersthesis{abc,
	abstract = {The web has become a real-time communication medium, used by a large
amount of people, in ever-increasing parts of their daily life. This new usage
pattern gives advertisers and marketeers great opportunities to learn about
their customers{\textquoteright} temporal interests, thoughts and current context.
Despite it{\textquoteright}s a well known fact that this information is extremely valuable for
advertising and product recommendation, online advertising is adapting only
slowly. This is mainly due to the fact that it{\textquoteright}s not clear what information is
valuable to use, the huge amount of produced data and the lack of effcient
models to process this data.
This thesis describes an approach to implement Scalable Real-Time Product
Recommendation based on Users Activity in a Social Network. The products
are taken from Amazon.com and the used social network is the microblogging
platform Twitter. It presents an implementation of this approach on top of the
key-value database Cassandra, using a system called Triggy. Triggy extends
Cassandra with incremental Map-Reduce tasks for push-style data processing.
Use cases that require high-performance analysis of large amounts of data,
are the showpiece of every stream processing engine. These engines are built to
process massive amounts of data in very short time. Therefore, this thesis contains
a comparison between four state-of-the-art distributed stream processing
engines and the implementation with Triggy. It{\textquoteright}s showed that the analyzed use
case has various properties that make it{\textquoteright}s implementation in a stream processing
engine impossible.
Finally, a demo application is presented, to show the described approach.},
	author = {Michael Haspra},
	school = {7},
	title = {Scalable Real-Time Product Recommendation based on Users Activity in a Social Network},
	year = {2011}
}
Systems Group Master's Thesis, no. 6; Department of Computer Science, April 2011
Supervised by: Prof. Gustavo Alonso
This thesis presents a mechanical pattern-based transformation method for introducing an asynchronous communication mode in a legacy PL/I application running on the IMS platform. The method is presented as part of a more generally applicable framework of governing high-level solution concepts. Together, the solutions form a cost-flexible spectrum of holistic approaches ranging from a Synchronous Callout to a full asynchronous Request/Callback-based mode. The provided reengineering patterns consist of well-defined mechanical steps; hence the resulting cost is highly predictable. As a further benefit, both the patterns and solutions are mutually independent, thereby making the framework modular and in turn facilitating enhancement and replacement on the level of individual components. Work on this project was initiated as a result of newly arising circumstances in the IT infrastructure of Credit Suisse. For the first time, the mainframe is expected to serve as a client of cross-platform communication and as previous research has shown, synchronous communication may in many cases prove insufficient. The approach demonstrated in this work was designed to mitigate the corresponding deficiencies. It is expected to serve as a foundation for a complete, inexpensive reengineering solution with well-defined risks that will considerably ease the upcoming large-scale migration of legacy applications away from the mainframe.
@mastersthesis{abc,
	abstract = {This thesis presents a mechanical pattern-based transformation method for introducing an asynchronous communication mode in a legacy PL/I application running on the IMS platform. The method is presented as part of a more generally applicable framework of governing high-level solution concepts. Together, the solutions form a cost-flexible spectrum of holistic approaches ranging from a Synchronous Callout to a full asynchronous Request/Callback-based mode.
The provided reengineering patterns consist of well-defined mechanical steps; hence the resulting cost is highly predictable. As a further benefit, both the patterns and solutions are mutually independent, thereby making the framework modular and in turn facilitating enhancement and replacement on the level of individual components.
Work on this project was initiated as a result of newly arising circumstances in the IT infrastructure of Credit Suisse. For the first time, the mainframe is expected to serve as a client of cross-platform communication and as previous research has shown, synchronous communication may in many cases prove insufficient. The approach demonstrated in this work was designed to mitigate the corresponding deficiencies. It is expected to serve as a foundation for a complete, inexpensive reengineering solution with well-defined risks that will considerably ease the upcoming large-scale migration of legacy applications away from the mainframe.},
	author = {Lucia Ambrosova},
	school = {6},
	title = {A cost-flexible approach to transforming a legacy PL/I application to perform an asynchronous remote service call},
	year = {2011}
}
Database Systems for Advanced Applications - 16th International Conference, DASFAA 2011, Hong Kong, China, April 2011
@inproceedings{abc,
	author = {Junjie Yao and Bin Cui and Qiaosha Han and Ce Zhang and Yanhong Zhou},
	booktitle = {Database Systems for Advanced Applications - 16th International Conference, DASFAA 2011, Hong Kong, China},
	title = {Modeling User Expertise in Folksonomies by Fusing Multi-type Features.},
	url = {http://dx.doi.org/10.1007/978-3-642-20149-3_6},
	year = {2011}
}
Systems Group Master's Thesis, no. 1; Department of Computer Science, March 2011
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Andreas Morf},
	school = {1},
	title = {Snapshot Isolation in Distributed Column-Stores},
	year = {2011}
}
Systems Group Master's Thesis, no. 5; Department of Computer Science, March 2011
Supervised by: Prof. Donald Kossmann
@mastersthesis{abc,
	author = {Conrado Plano},
	school = {5},
	title = {Complex Query Processing with MapReduce in a Multi-Terabyte Logging Cluster},
	year = {2011}
}
ETH Zürich, Diss. Nr. 19610, March 2011
Supervised by: Prof. Gustavo Alonso
Computing systems are increasingly becoming dynamic. One example is cloud computing and its property of elasticity. On cloud platforms, resources in the form of additional nodes can be added and removed at any time. Software systems expected to run in such environments, on the other hand, are not nearly as elastic and flexible. Another example is the clearly visible trend towards incorporating an increasing number of processor cores into modern computer systems. In the future, it is likely that not all of these cores will have a uniform instruction set anymore but specialized units are used to accelerate certain tasks. At the same time, however, the power envelope of computer systems is increasingly becoming an issue so that probably not all cores can run at the same time anymore. Software written for such systems is thereby required to adapt to a changing pool of resources, a situation that today’s software is hardly prepared for. This thesis contributes towards understanding how to build software systems that are able to reflect such a degree of flexibility. The fundamental observation underlying this work is that in order to respond to the emerging dynamism in the platforms, software has to become equally flexible in its design. This requires a segregation of the software into smaller units. In the early days of computer science similar challenges in the development process of complex software have catalyzed the concept of software modularity. The premise of this thesis is that the same kind of modularization—when not only applied on a logical level to the source code but also in a physical form and preserved until runtime— results in the required degrees of freedom in the design of software systems. In combination with a smart runtime system, this approach turns software into flexible, fluid entities able to adapt to a dynamic environment. Three systems are presented in this thesis which are based upon the OSGi standard for dynamic mo