Research

My research focuses on the evaluation of machine learning (ML) systems: to help figure out which ML systems are “better” – and which are “worse”. As ML systems become more capable, answering this type of question (generally) gets harder: judging state-of-the-art chatbots is more challenging than comparing simple image classifiers. Yet, especially for capable ML systems, reliable evaluation is critical for guiding ML research towards truly helpful and harmless systems.

My work within ML evaluation is primarily focused on language model traits that are important to users but not well-covered by existing benchmarks, often working with pairwise preference data. Most recently, I released the Feedback Forensics toolkit for measuring AI personality. Relatedly, our Inverse Constitutional AI work enables automatically interpreting human feedback-based evaluation methods. Previously, I worked on benchmarking more specialised (smaller) ML models: I built benchmark tools for (meta) reinforcement learning (RL) methods in the context of efficient building control systems (see the Bauwerk and Beobench libraries).

I often apply benchmark methods to specifically evaluate environmental impacts of ML systems. In the context of LLMs, I use my methods to evaluate the implicit environmental preferences present in data used for fine-tuning and evaluating models (e.g. how much environmental concerns are considered in the data). Previously, I primarily considered effects of a direct environmental use-case of ML: how effective can ML-based building control be at lowering emissions.

Look at the projects page for a list of research software projects and papers.