Publications by Kaan Kara

×

Status message

The Publications site is currently under construction, as a result some publications might be missing.

2017

Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, August 2017
@inproceedings{abc,
	author = {Hantian Zhang and Jerry Li and Kaan Kara and Dan Alistarh and Ji Liu and Ce Zhang},
	booktitle = {Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia},
	title = {ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning.},
	url = {http://proceedings.mlr.press/v70/zhang17e.html},
	year = {2017}
}
Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 2017
Implementing parallel operators in multi-core machines often involves a data partitioning step that divides the data into cache-size blocks and arranges them so to allow concurrent threads to process them in parallel. Data partitioning is expensive, in some cases up to 90% of the cost of, e.g., a parallel hash join. In this paper we explore the use of an FPGA to accelerate data partitioning. We do so in the context of new hybrid architectures where the FPGA is located as a co-processor residing on a socket and with coherent access to the same memory as the CPU residing on the other socket. Such an architecture reduces data transfer overheads between the CPU and the FPGA, enabling hybrid operator execution where the partitioning happens on the FPGA and the build and probe phases of a join happen on the CPU. Our experiments demonstrate that FPGA-based partitioning is significantly faster and more robust than CPU-based partitioning. The results open interesting options as FPGAs are gradually integrated tighter with the CPU.
@inproceedings{abc,
	abstract = {Implementing parallel operators in multi-core machines often involves a data partitioning step that divides the data into cache-size blocks and arranges them so to allow concurrent threads to process them in parallel. Data partitioning is expensive, in some cases up to 90\% of the cost of, e.g., a parallel hash join. In this paper we explore the use of an FPGA to accelerate data partitioning. We do so in the context of new hybrid architectures where the FPGA is located as a co-processor residing on a socket and with coherent access to the same memory as the CPU residing on the other socket. Such an architecture reduces data transfer overheads between the CPU and the FPGA, enabling hybrid operator execution where the partitioning happens on the FPGA and the build and probe phases of a join happen on the CPU. Our experiments demonstrate that FPGA-based partitioning is significantly faster and more robust than CPU-based partitioning. The results open interesting options as FPGAs are gradually integrated tighter with the CPU.},
	author = {Kaan Kara and Jana Giceva and Gustavo Alonso},
	booktitle = {Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017},
	title = {FPGA-based Data Partitioning.},
	url = {http://doi.acm.org/10.1145/3035918.3035946},
	venue = {Chicago, IL, USA},
	year = {2017}
}
Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 2017
@inproceedings{abc,
	author = {David Sidler and Zsolt Istv{\'a}n and Muhsen Owaida and Kaan Kara and Gustavo Alonso},
	booktitle = {Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA},
	title = {doppioDB: A Hardware Accelerated Database.},
	url = {http://doi.acm.org/10.1145/3035918.3058746},
	year = {2017}
}
25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, April 2017
@inproceedings{abc,
	author = {Muhsen Owaida and David Sidler and Kaan Kara and Gustavo Alonso},
	booktitle = {25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA},
	title = {Centaur: A Framework for Hybrid CPU-FPGA Databases.},
	url = {https://doi.org/10.1109/FCCM.2017.37},
	year = {2017}
}
25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, Napa, CA, USA, April 2017
Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme-called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.
@inproceedings{abc,
	abstract = {Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme-called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.},
	author = {Kaan Kara and Dan Alistarh and Gustavo Alonso and Onur Mutlu and Ce Zhang},
	booktitle = {25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA},
	title = {FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off.},
	url = {https://doi.org/10.1109/FCCM.2017.39},
	venue = {Napa, CA, USA},
	year = {2017}
}
CoRR, January 2017
@inproceedings{abc,
	author = {Heng Guo and Kaan Kara and Ce Zhang},
	booktitle = {CoRR},
	title = {Layerwise Systematic Scan: Deep Boltzmann Machines and Beyond.},
	url = {http://arxiv.org/abs/1705.05154},
	year = {2017}
}

2016

26th International Conference on Field Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland, August 2016
@inproceedings{abc,
	author = {Kaan Kara and Gustavo Alonso},
	booktitle = {26th International Conference on Field Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland},
	title = {Fast and robust hashing for database operators.},
	url = {http://dx.doi.org/10.1109/FPL.2016.7577353},
	year = {2016}
}
CoRR, January 2016
We present ZipML, the first framework for training dense generalized linear models using end-to-end low-precision representation--in ZipML, all movements of data, including those for input samples, model, and gradients, are represented using as little as two bits per component. Within our framework, we have successfully compressed, separately, the input data by 16x, gradient by 16x, and model by 16x while still getting the same training result. Even for the most challenging datasets, we find that robust convergence can be ensured using only an end-to-end 8-bit representation or a 6-bit representation if only samples are quantized. Our work builds on previous research on using low-precision representations for gradient and model in the context of stochastic gradient descent. Our main technical contribution is a new set of techniques which allow the training samples to be processed with low precision, without affecting the convergence of the algorithm. In turn, this leads to a system where all data items move in a quantized, low precision format. In particular, we first establish that randomized rounding, while sufficient when quantizing the model and the gradients, is biased when quantizing samples, and thus leads to a different training result. We propose two new data representations which converge to the same solution as in the original data representation both in theory and empirically and require as little as 2-bits per component. As a result, if the original data is stored as 32-bit floats, we decrease the bandwidth footprint for each training iteration by up to 16x. Our results hold for models such as linear regression and least squares SVM. ZipML raises interesting theoretical questions related to the robustness of SGD to approximate data, model, and gradient representations. We conclude this working paper by a description of ongoing work extending these preliminary results.
@article{abc,
	abstract = {We present ZipML, the first framework for training dense generalized linear models using end-to-end low-precision representation--in ZipML, all movements of data, including those for input samples, model, and gradients, are represented using as little as two bits per component. Within our framework, we have successfully compressed, separately, the input data by 16x, gradient by 16x, and model by 16x while still getting the same training result. Even for the most challenging datasets, we find that robust convergence can be ensured using only an end-to-end 8-bit representation or a 6-bit representation if only samples are quantized. Our work builds on previous research on using low-precision representations for gradient and model in the context of stochastic gradient descent. Our main technical contribution is a new set of techniques which allow the training samples to be processed with low precision, without affecting the convergence of the algorithm. In turn, this leads to a system where all data items move in a quantized, low precision format. In particular, we first establish that randomized rounding, while sufficient when quantizing the model and the gradients, is biased when quantizing samples, and thus leads to a different training result. We propose two new data representations which converge to the same solution as in the original data representation both in theory and empirically and require as little as 2-bits per component. As a result, if the original data is stored as 32-bit floats, we decrease the bandwidth footprint for each training iteration by up to 16x. Our results hold for models such as linear regression and least squares SVM. ZipML raises interesting theoretical questions related to the robustness of SGD to approximate data, model, and gradient representations. We conclude this working paper by a description of ongoing work extending these preliminary results. },
	author = {Hantian Zhang and Kaan Kara and Jerry Li and Dan Alistarh and Ji Liu and Ce Zhang},
	journal = {CoRR},
	title = {ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models.},
	url = {http://arxiv.org/abs/1611.05402},
	year = {2016}
}