Open Thesis Topics in Data Management for Data Science @ Graz University of Technology

The following list shows selected, open thesis topics that could be immediately started. However, our group is generally interested in better systems support for the entire end-to-end data science lifecycle and thus, always open for discussions on additional topics.

BMW: Data Analytics in the Production of Lithium-Ion-Battery-Cells

Category: Internship (at )

Abstract: The production of battery cells is to be regarded as a complex process chain, which is characterized by unknown interactions that lead to quality fluctuations and product defects. The goal is to generate knowledge by selecting suitable methods of data analytics in order to build up product and process expertise. For this purpose you will examine the data in the production of Lithium-Ion-Batteries from a proprietary datalake to discover deeper insights, make predictions, or generate recommendations. Besides, you will develop an information system that supports decision-making activities on the operational level of the production of Lithium-Ion-Batteries. In addition to a steep learning curve, work on a future-oriented and innovative topic area awaits you.

Details: English

AVL: Anomaly Detection on Test beds

Category: Master (at )

Abstract: The main goal of this thesis is to develop algorithms to detect anomalies on test beds. In order to monitor the condition of a UUT (unit-under-test) during a durability run the measurement data needs to be analyzed. In this thesis we want to focus on the analysis of one specific durability run on an engine test bed, which consists of periodically repeating cycles. How can we detect anomalies in such cyclic data?

Details: English, German

JOANNEUM RESEARCH - HEALTH: Development and validation of a hybrid decision model to identify frailty in older adults with care needs in geriatric care facilities

Category: Master (at )

Abstract: In order to further improve quality assurance in acute geriatrics and remobilization facilities (AG/R), a system for "benchmarking and reporting in acute geriatrics facilities " (BARS, www.healthgate.at) that has been in use for more than 10 years is being newly specified and developed. This new development is being linked with innovative eHealth systems for process and decision support. One part of this new development deals with the development of a process-integrated patient guidance system to identify frail patients with care needs in geriatric care facilities. The aim of this master thesis is the development and validation of a hybrid decision model for the identification of frail patients. The model to be developed accesses the following data sources: (1) Classical phenotypic classification according to Fried et al. (2001), (2) Classification based on ICD-10 diagnoses according to Gilbert et al. (2018), and (3) Classification using a data-driven approach. A quality-assured data set with over 100,000 patient stays from the existing benchmarking system is available for model development. Predictors for a successful outcome of a stay in a geriatric care facility are to be identified for model building. Since the decision model is to be used in the routine in the future, a process-integrated and timesaving approach must also be emphasized.

Details: English, German

#1: Recycling Intermediates in Feature Selection and Hyper-parameter Optimization Workloads

Category: Bachelor/Master

Status: In progress (SS19, Benjamin Rath)

Abstract: Building a robust machine learning (ML) pipeline for a given application use case, often involves experimenting with different algorithms and feature selection strategies, as well as dedicated hyper-parameter optimization. Algorithms for feature selection and hyper-parameter optimization themselves are iterative, repeatedly invoking the same ML algorithms with overlapping subsets of features, and different parameters. In this context, commonly used libraries of ML algorithms are treated in a black-box manner, which leads to redundant computation and thus, unnecessarily slows down the exploration process. Inspired by work on recycling intermediates in query processing, this topic aims to explore opportunities for recycling intermediates in linear algebra programs, specifically in feature selection and hyper-parameter optimization workloads. The key ideas are to track the lineage of intermediates (operations and non-determinism), store these intermediates in a (bounded-size) in-memory cache, and reuse intermediates for redundant operations. Major challenges include the efficient tracking of local and distributed operations, effective reuse of partial intermediates via compensation plans, and effective eviction policies. All implementation work and experiments can be done in SystemDS, but alternative system proposals are possible.

#2: An Experimental Analysis of Auto Differentiation Techniques

Category: Bachelor

Abstract: The training of deep neural networks still predominantly relies on general-purpose optimizers like mini-batch Stochastic Gradient Decent (SGD), or extensions like SGD-Nesterov, SGD-momentum, RMSprop, Adam, and Adagrad, which share the same computational characteristic. Given a batch of training examples, we make a forward pass through the network, followed by a backward pass to derive gradients for the next optimizer update. To simplify the exploration of new architectures, deep learning frameworks like TensorFlow or PyTorch all provide means of auto differentiation (AD) for automatically deriving the backward pass from the forward pass. This topic aims to implement state-of-the-art AD techniques in a common framework (preferably SystemDS, which does not yet support AD), and compare these techniques in terms of performance, memory requirements, and computational flexibility. In case this comparison reveals major shortcomings, there is the possibility of a follow-up master thesis on extending these techniques.

#3: Resource-Constrained Operator Scheduling for Model Scoring

Category: Bachelor/Master

Abstract: An often overlooked aspect of ML systems is the low-level scheduling of physical operators, which affects the memory requirements for live intermediates in complex linear algebra programs. All existing systems follow data dependencies, but order instructions either in a breadth-first or depth-first manner, sometimes combined with runtime approaches such as ready queues and adaptive recomputation. Especially in ML scoring scenarios, memory requirements can become the dominating factor due to concurrent scoring streams, or limited device memory. In contrast to traditional programming language compilers, ML systems are able infer sizes of intermediates (dimensions and sparsity), which allows for more intelligent scheduling decisions. Accordingly, this topic aims to device novel scheduling schemes for minimal memory requirements, and minimal runtime subject to a given memory budget. Potential lossless program transformations include DAG-level operator ordering, code motion across control flow, redundant computation, and update-in-place decisions. Besides classic combinatorial search and pruning techniques, we are also interested to explore deep reinforcement learning for adaptively learning the search policy.

#4: An Experimental Evaluation of Label Generation Techniques

Category: Bachelor

Abstract: Learning complex machine learning (ML) models often requires large amounts of labeled data, which can be costly to acquire in absence of application feedback loops. Recently, we see a trend toward data programming or weak supervision in general to automatically generate noisy labels for large, unlabeled datasets as a means of pretraining, for example, deep neural networks. Common techniques include traditional boosting, label generation from conflicting labeling functions, the automatic generation of such labeling functions, and domain-specific object composition (e.g., placing a known object on a random background). This topic aims to conduct a systematic comparison of these techniques on different datasets, and ML algorithms in order to gain a better understanding under which settings these techniques yield overall improvements of model accuracy.

#5: An Experimental Analysis of Sparse Tensor Formats on CPUs and GPUs

Category: Bachelor

Abstract: Sparse matrices are ubiquitous in ML workloads for natural language processing, graph analytics, recommender systems, and scientific computation, but also as a result of data preprocessing and filtering. Therefore, various ML systems provide sparse matrix representations and automatically dispatch sparse operations. In contrast, modern deep learning frameworks like TensorFlow or PyTorch still have very limited support for sparse tensors. This topic aims at a systematic evaluation of sparse tensor formats and operations in existing ML systems, and vendor-provided BLAS/DNN libraries. In detail, this entails an up-to-date survey of sparsity support in different ML systems, a benchmark for common data characteristics and operations (in the context of traditional ML and deep neural networks), as well as the implementation of selected formats and operations for both CPU and GPU devices. Relevant formats include COO, extensions of CSR/CSC for tensors, as well as the more recent CSF (compressed sparse fiber) format. Especially for GPU devices, additional extensions are likely necessary to cope with the irregularities of sparse tensors. Ultimately, the results of this thesis should guide the design and implementation of sparse tensor support and give robust decision rules for the selection of formats based on data characteristics.

#6: Efficient Matrix-Matrix Multiplication on Heterogeneous, Losslessly Compressed Matrices

Category: Bachelor/Master

Abstract: Matrix compression is a widely used technique to speed-up machine learning (ML) workloads. This includes lossy compression such as quantization (reduces value domain) and sparsification (reduces non-zero values) as commonly used for deep neural networks, as well as lossless compression via compressed linear algebra (CLA). CLA compresses matrices column-wise with lightweight database compression techniques (e.g., dictionary coding, RLE, and OLE, as well as column co-coding), and performs memory-bandwidth-bound linear algebra operations such matrix-vector multiplications directly over the compressed representation. The reduced size allows fitting larger matrices into local or distributed memory, and thus, improves performance for datasets that exceed available memory. This topic aims to extend this CLA framework by efficient matrix-matrix multiplications. So far, these operations are realized by repeated matrix-vector multiplications, which lacks cache blocking. However, column co-coding also reduces the number of required floating point operations, which makes CLA interesting for compute-intensive matrix-matrix operations. The major challenge is an efficient blocking scheme for these heterogeneously encoded, column groups, as well as sparse encoding such as RLE and OLE.

#7: LLVM-Based Code Generation Backend for Linear Algebra Programs on CPUs and GPUs

Category: Bachelor/Master

Abstract: There are many opportunities for fused operators (composed of chains of basic operators) in ML workloads. This includes avoiding unnecessary intermediates (whose allocation and write is often the bottleneck), scan sharing, and the exploitation of sparsity across operations. Accordingly, in the last years, numerous operator fusion and code generation frameworks were introduced. SystemDS inherits its Java-based code generation framework from Apache SystemML, which supports the exact, cost-based optimization of operator fusion plans for DAGs of linear algebra operations, as well as local and distributed operations over dense, sparse, and compressed data. This topic aims to extend this framework, by an additional LLVM-based native compiler backend for both CPUs and GPUs. In detail, this entails additional LLVM compiler infrastructure, code templates for LLVM, native vector libraries, as well as dedicated tuning of data transfer and individual kernels. Additional (but optional) challenges include the efficient GPU code generation for sparse and compressed matrices.

#8: Heterogeneous Quantization for Lossy Data Compression in ML Model Training

Category: Bachelor/Master

#9: Efficient, Distributed Outlier Detection for Point and Sequence Anomalies

Category: Bachelor/Master