My research focuses on the evaluation of machine learning (ML) systems: to help figure out which ML systems are “better” – and which are “worse”. As ML systems become more capable, answering this type of question (generally) gets harder: judging state-of-the-art chatbots is more challenging than comparing simple image classifiers. Yet, especially for capable ML systems, reliable evaluation is critical for guiding ML research towards truly helpful and harmless systems.

My work within ML evaluation is primarily focused on benchmarks: software to compare and rank ML models automatically. Currently, I am working on some of the open problems in benchmarking the most capable ML models, such as large language models (LLMs). Previously, I worked on benchmarking more specialised (smaller) ML models: I built benchmark tools for (meta) reinforcement learning (RL) methods in the context of efficient building control systems (see the Bauwerk and Beobench libraries).

I often apply benchmark methods to specifically evaluate environmental impacts of ML systems. In the context of LLMs, I use my methods to evaluate the implicit environmental preferences present in data used for fine-tuning and evaluating models (e.g. how much environmental concerns are considered in the data). Previously, I primarily considered effects of a direct environmental use-case of ML: how effective can ML-based building control be at lowering emissions.

Look at the projects page for a list of research software projects and papers.