My research focuses on the evaluation of applied machine learning (ML) systems. Currently, I am working on language model (LM) evaluation with a special focus on human-interpretable evaluation systems (see KraspAI). The goal is to make ML capabilities and risks easier to investigate and understand. Much of my work is centred around creating standardised benchmark tools for specific problems – to help accelerate progress on these problems. My prior work includes benchmark tools for the evaluation of (meta) reinforcement learning (RL) methods in the context of building control systems (see Bauwerk, Beobench).

Across my evaluation work, I pay special attention to evaluating environmental impacts. I consider both direct (e.g. from carbon emissions of energy usage) and indirect environmental impacts (e.g. from LM-based decision making).

Look at the projects page for a list of research software projects and papers.