# Theses

The working group typically offers various thesis topics each semester in the areas computational statistics, machine learning, data mining, optimization and statistical software. You’re welcome to suggest your own topic as well.

Before you apply for a thesis topic make sure that you fit the following profile:

- Knowledge in machine learning.
- Good R skills.

Before you start writing your thesis you **must** look for a supervisor within the working group.

Send an email to **compstat [at] stat.uni-muenchen.de** with the following information:

- Planned starting date of your thesis.
- Thesis topic (of the list of thesis topics or your own suggestion).
- Previously attended classes on machine learning and programming with R.

*Your application will only be processed if it contains* all *required information.*

## Potential Thesis Topics

[Potential Thesis Topics] [Student Research Projects] [Current Theses] [Completed Theses]

Below is a list of potential thesis topics. Before you start writing your thesis you **must** look for a supervisor within the working group.

#### * Learning Embeddings for Categorical Variables (Betreuer: Florian Pfisterer)

Many machine learnings naturally lend themselves to numeric data. In order for them to be able to deal with categorical data, either extensions of the algorithms or numerical representations (one-hot encoding etc.) are required. A class of those numerical representations are so called ‘embeddings’, that can be obtained for example from neural networks. Embeddings can be learned from datasets using different methods. Methods that allow for learning embeddings will be implemented and tested in this thesis.

Possible directions:

- Explorative and Interpretable embeddings for a single dataset (e.g. video game data).
- Embedding as as a general method for encoding categorical variables.

#### * Compressing Ensembles of Machine Learning Models (Betreuer: Florian Pfisterer)

Complex ensembles of machine learning models are usually more performant, but very hard to deploy in real world applications, such as mobile phones, machines etc. The question to be answered in this work, is whether we can compress the results of an ensemble into a single model, that is (possibly) easily deployable with minimal prerequisites and (technical, time-) overhead. Training of NN’s can be simplified, as overfitting on the predictions of the ensemble is no longer a problem, but something to strive for. A possible class of those approximators can be the family of (feed-forward) neural networks. The work includes implementing functionality that allows for training a learner on the output of an arbitrary ensemble / model. Afterwards, an evaluation of the model performance and resulting stability / usability in the proposed context of compression needs to be conducted. This includes comparing different NN architectures with respect to stability, and evaluating possible extensions to the usual training processes, that would allow for faster or more stable training. An additional question is, whether some parts of preprocessing can also be approximated in this way, which would further reduce the overhead required for real world deployment of such models.

#### * Multi-Output Prediction (Betreuer: Quay Au)

The general learning task of predicting multiple targets, which could be real-valued, binary, ordinal, categorical or even of mixed type is known as multi-output prediction. The general idea is to improve the accuracy of a predictor by making use of the statistical dependencies among the output variables. Methods, which transform the multi-output prediction problem into single-output prediction problems, so that ordinary classification and regression algorithms can be applied, shall be implemented in the machine learning R package mlr. The evaluation of multi-output prediction problems, is inherently a challenging task and shall be worked out in this thesis.

#### * Learning to Optimize with Reinforcement Learning (Betreuer: Xudong Sun)

Reinforcement learning serves as a powerful tool for optimizing problems from different field. For example learning to optimize, reinforcement active learning, deep compression and so on. This project will use reinforcement learning to optimize a chosen machine learning problem. Various other topics on reinforcement learning including automatic code writing and soft Q learning is also possible.

#### * Video Activity Detection Using Convolutional Recurrent Neural Networks (Betreuer: Xudong Sun)

This project will utilize some state of art models in recurrent neural network and convolutional neural network and benchmark the results on some public datasets, for instance UCF101. This is an application of functional on scalar classification extended from the one dimensional curve case to multidimensional image case.

#### * Federated Machine Learning with non i.i.d sites (Betreuer: Xudong Sun)

The project will explore the problem when federated data center has different data distribution and its effect on the state of art distributed machine learning algorithms (using map-reduce or federated or online learning).

## Disputation

#### Procedure

The disputation of a thesis lasts about 60-90 minutes and consists of two parts. Only the first part is relevant for the grade and takes 30 minutes (bachelor thesis) and 40 minutes (master thesis). Here, the student is expected to summarize his/her main results of the thesis in a presentation. The supervisor(s) will ask questions regarding the content of the thesis in between. In the second part (after the presentation), the supervisors will give detailed feedback and discuss the thesis with the student. This will take about 30 minutes.

#### FAQ

- How do I prepare for the disputation?

You have to prepare a presentation and if there is a bigger time gap between handing in your thesis and the disputation you might want to reread your thesis.

- How many slides should I prepare?

That’s up to you, but you have to respect the time limit. Prepariong more than 20 slides for a Bachelor’s presentation and more than 30 slides for a Master’s is VERY likely a very bad idea.

- Where do I present?

Bernd’s office, in front of the big TV. At least one PhD will be present, maybe more. If you want to present in front of a larger audience in the seminar room or the old library, please book the room yourself and inform us.

- English or German?

We do not care, you can choose.

- What do I have to bring with me?

A document (Prüfungsprotokoll) which you get from “Prüfungsamt” (Frau Maxa or Frau Höfner) for the disputation.Your laptop or a USB stick with the presentation. You can also email Bernd a PDF.

- How does the grading work?

The student will be graded regarding the quality of the thesis, the presentation and the oral discussion of the work. The grade is mainly determined by the written thesis itself, but the grade can improve or drop depending on the presentation and your answers to defense questions.

- What should the presentation cover?

The presentation should cover your thesis, including motivation, introduction, description of new methods and results of your research. Please do NOT explain already existing methods in detail here, put more focus on novel work and the results.

- What kind of questions will be asked after the presentation?

The questions will be directly connected to your thesis and related theory.

## Student Research Projects

[Potential Thesis Topics] [Student Research Projects] [Current Theses] [Completed Theses]

We are always interested in mentoring interesting student research projects. Please contact us directly with an interesting resarch idea. In the future you will also be able to find research project topics below.

### Available projects

Requirements for available

#### Efficient Job Scheduling For High Performance Computing *

Large, possibly embarassingly parallel jobs, for example evaluations of ML algorithms on different datasets/resampling splits, often require large overheads in terms of scheduling. In order to reduce load on cluster workload managers and to improve flexibility, a new scheduling manager is to be created and improved, which allows for asynchronus scheduling on multiple clusters.

#### Improved Exploration and Experiment Selection *

Knowledge about machine learning algorithms and their behavious for different hyperparameters can be gathered from a large amount of evaluations of random algorithms and configurations on random datasets. As hyperparameter spaces widely differ from algorithm to algorithm, better selection can improve the exploration aspect of this approach. Methods from the Design and Analysis of Computer Experiments, DACE yield possibilities for more efficient exploration of those spaces. An approach for this could be using models in order to schedule experiments that optimize for exploration of the relevant hyper parameter space.

#### Interactive Visualization of HPC Experiment Eesults *

Deeper knowledge into qualitative relationships in large benchmark results can be obtained from interactive visualizations and modeling approaches. Within this project different vizualisation and modeling techniques are to be explored and evaluated. Interactive vizualisation methods that allow for zomm out and drill down need to be evaluated.

- Seperate funding for this project (Student Assistant) can be made available.

For more information please visit the official web page Studentische Forschungsprojekte (Lehre@LMU)

## Current Theses (With Working Titles)

[Potential Thesis Topics] [Student Research Projects] [Current Theses] [Completed Theses]

Student | Title | Type |
---|---|---|

B. Burger | Average Marginal Effects in Machine Learning | MA |

S. Gruber | Visualization and Efficient Replay Memory for Reinforcement Learning | BA |

A. Bukreeva | Neural Network Embeddings for Categorical Data | BA |

R. Groh | Benchmarking: Tests and Vizualisations | MA |

## Completed Theses

[Potential Thesis Topics] [Student Research Projects] [Current Theses] [Completed Theses]

### Completed Theses (LMU Munich)

Student | Title | Type | Completed |
---|---|---|---|

J. Goschenhofer | MA | 2018 | |

J. Moosbauer | Bayesian Optimization under Noise for Model Selection in Machine Learning | MA | 2018 |

J. Fried | Interpretable Machine Learning - An Application Study using the Munich Rent Index | MA | 2018 |

S. Coors | Automatic Gradient Boosting | MA | 2018 |

D. Schalk | Efficient and Distributed Model-Based Boosting for Large Datasets | MA | 2018 |

K. Engelhardt | Linear individual model-agnostic explanations - discussion and empirical analysis of modifications | MA | 2018 |

N. Klein | Extending Hyperband with Model-Based Sampling Strategies | MA | 2018 |

M. Dumke | Reinforcement learning in R | MA | 2018 |

M. Lee | Anomaly Detection using Machine Learning Methods | MA | 2018 |

J. Langer | RNN Bandmatrix | MA | 2018 |

B. Klepper | Configuration of deep neural networks using model-based optimization | MA | 2017 |

F. Pfisterer | Kernelized anomaly detection | MA | 2017 |

M. Binder | Automatic model selection amd hyperparameter optimization | MA | 2017 |

V. Mayer | mlrMBO / RF distance based infill criteria | MA | 2017 |

L. Haller | Kostensensitive Entscheidungsbäume für beobachtungsabhängige Kosten | BA | 2016 |

B. Zhang | Implementation of 3D Model Visualization for Machine Learning | BA | 2016 |

T. Riebe | Eine Simulationsstudie zum Sampled Boosting | BA | 2016 |

P. Rösch | Implementation and Comparison of Stacking Methods for Machine Learning | MA | 2016 |

M. Erdmann | Runtime estimation of ML models | BA | 2016 |

A.Exterkate | Process Mining: Checking Methods for Process Conformance | MA | 2016 |

J.-Q. Au | Implementation of Multilabel Algorithms and their Application on Driving Data | MA | 2016 |

(J.-Q. Au was a master student of TU Dortmund) | |||

J. Thomas | Stability Selection for Component-Wise Gradient Boosting in Multiple Dimensions | MA | 2016 |

A. Franz | Detecting Future Equipment Failures: Predictive Maintenance in Chemical Industrial Plants | MA | 2016 |

T. Kühn | Fault Detection for Fire Alarm Systems based on Sensor Data | MA | 2016 |

B. Schober | Laufzeitanalyse von Klassifikationsverfahren in R | BA | 2015 |

F. Pfisterer | Benchmark Analysis for Machine Learning in R | BA | 2015 |

T. Kühn | Implementierung und Evaluation ergänzender Korrekturmethoden für statistische Lernverfahren | BA | 2014 |

bei unbalancierten Klassifikationsproblemen |

### Completed Theses (Supervised by Bernd Bischl at TU Dortmund)

Student | Title | Type | Completed |
---|---|---|---|

P. Probst | Anwendung von Multilabel-Klassifikationsverfahren auf Medizingerätestatusreporte zur Generierung von Reparaturvorschlägen | MA | 2015 |

D. Kirchhoff | Erweiterung der Plattform OpenML um Ereigniszeitanalysen | MA | 2015 |

J. Bossek | Modellgestützte Algorithmenkonfiguration bei Feature-basierten Instanzen: Ein Ansatz über das Profile-Expected-Improvement | Dipl. | 2015 |

J. Richter | Modellbasierte Hyperparameteroptimierung für maschinelle Lernverfahren auf großen Daten | MA | 2015 |

B. Elkemann | Implementierung einer Testsuite für mehrkriterielle Optimierungsprobleme | BA | 2014 |

M. Dagge | R-Pakete für Datenmanagement und -manipulation großer Datensätze | BA | 2014 |

K. U. Schorck | Lokale Kriging-Verfahren zur Modellierung und Optimierung gemischter Parameterräume mit Abhängigkeitsstrukturen | BA | 2014 |

P. Kerschke | Kostensensitive Algorithmenselektion für stetige Black-Box-Optimierungsprobleme basierend auf explorativer Landschaftsanalyse | MA | 2013 |

D. Horn | Exploratory Landscape Analysis für mehrkriterielle Optimierungsprobleme | MA | 2013 |

J. Bossek | Feature-based Algorithm Selection for the Traveling-Salesman-Problem | BA | 2013 |

O. Meyer | Implementierung und Untersuchung einer parallelen Support Vector Machine in R | Dipl. | 2013 |

S. Hess | Sequential Model-Based Optimization by Ensembles: A Reinforcement Learning Based Approach | Dipl. | 2012 |

P. Kerschke | Vorhersage der Verkehrsdichte in Warschau basierend auf dem Traffic Simulation Framework | BA | 2011 |

L. Schlieker | Klassifikation von Blutgefäßen und Neuronen des menschlichen Gehirns anhand von ultramikroskopierten 3D-Bilddaten | BA | 2011 |

H. Riedel | Uncertainty Sampling zur Auswahl optimaler Sampler aus der trunkierten Normalverteilung | BA | 2011 |

S. Meinke | Over-/Undersampling für unbalancierte Klassifikationsprobleme im Zwei-Klassen-Fall | BA | 2010 |