The DAMSLab (data management for data science laboratory) is a cross-organizational research group uniting the data management group of Graz University of Technology, Austria and the research area data management of the co-located Know-Center GmbH, Austria.
Our overall mission is to simplify data science by providing high-level, data science-centric abstractions and building systems and tools to execute these tasks in an efficient and scalable manner. To this end, the primary research focus is on ML systems internals for the end-to-end data science lifecycle (from data integration and preparation, over efficient model training, to debugging and serving), large-scale, distributed data management and analysis, as well as infrastructure and benchmarks for data and ML systems.
Open Thesis Topics
We're looking for motivated PhD, master, and bachelor students to join our team. Our research focuses on building ML systems and tools for simplifying the data science liefecycle – from data integration over model training to deployment and scoring – via high-level language abstractions and specialized compiler and runtime techniques. If you're interested, please contact the project supervisor via email.
List of topics
- Johanna Sommer, Matthias Boehm, Alexandre V. Evfimievski, Berthold Reinwald, Peter J. Haas: MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions. SIGMOD 2019. [paper]
- Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald: Compressed Linear Algebra for Large-Scale Machine Learning. Commun. ACM 2019 62(5). [paper, Link]
- Matthias Boehm, Arun Kumar, Jun Yang: Data Management in Machine Learning Systems. Synthesis Lectures on Data Management 11 (1), Morgan & Claypool Publishers 2019. [book]
- Matthias Boehm, Alexandre V. Evfimievski, Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning. BTW 2019. [paper, slides]
Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient, local and distributed ML model training, to deployment and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for (1) the different tasks of the data-science lifecycle, and (2) users with different expertise. These high-level scripts are compiled into hybrid execution plans of local, in-memory CPU and GPU operations, as well as distributed operations on Apache Spark. In contrast to existing systems - that either provide homogeneous tensors or 2D Datasets - and in order to serve the entire data science lifecycle, the underlying data model are DataTensors, i.e., tensors (multi-dimensional arrays) whose first dimension may have a heterogeneous and nested schema.
Project is hosted at github.com
Our research group is grateful for funding support from BMVIT,
TU Graz, AVL LIST, Infineon Technologies Austria, Magna Steyr
Fahrzeugtechnik, voestalpine Stahl Donawitz, and Know-Center.