DS-2026-07: Taking a Step Back: Measuring and Mitigating Bias in Language Models

DS-2026-07: van der Wal, Oskar (2026) Taking a Step Back: Measuring and Mitigating Bias in Language Models. Doctoral thesis, University of Amsterdam.

	Text DS-2026-07.text.pdf - Published Version Download (5MB)
	Text (Samenvatting) DS-2026-07.samenvatting.txt Download (4kB)

Abstract

Language models increasingly shape how people access information, make decisions, and understand social issues.
Although they are often presented as neutral tools, these systems can reflect and reinforce social biases in subtle ways, influencing stereotypes, expectations, and judgments about different groups.
When a language model consistently depicts scientists as male or relies on historically biased racial information when processing clinical texts, it does more than reflect patterns in its training data: it can reinforce and normalize these associations across millions of interactions.
A common response to these concerns has been the proliferation of bias benchmarks and mitigation techniques.
However, improved benchmark scores are often interpreted as evidence of progress without sufficient clarity about what these measures actually capture, and mitigation efforts are limited when the underlying mechanisms of bias remain poorly understood.

The central question of this thesis is how to rigorously measure, understand, and mitigate representational bias in language models---systematic patterns in how models encode and use information about social groups and attributes.
Effective mitigation depends on understanding where and how biased associations arise, which in turn requires measurement approaches that are valid, reliable, and tested in realistic contexts.
To address these interconnected challenges, this thesis takes a step back and approaches this question from four complementary directions.

First, it develops a framework borrowed from psychology that treats bias as an underlying property that cannot be directly observed, but must be inferred indirectly from multiple measures.
This helps explain why existing metrics often disagree and establishes criteria for more credible measurement practices.
Second, through a pilot study of clinical decision support, it documents five recurring failure patterns when models process realistic patient notes, demonstrating that testing models in realistic scenarios reveals distinct problems not captured by simplified multiple-choice tests, and highlighting the methodological complexity of bias assessment in realistic contexts.
Third, to understand the origins of bias, the thesis uses methods that make model behaviour more transparent, tracing how biased patterns inside the model emerge during training and showing that gender information becomes increasingly concentrated in specific parts of the model across different architectures.
Fourth, by studying where bias is embedded in the model, it identifies which specific internal parts contribute to biased behaviour and demonstrates that targeted modifications to just these parts can reduce bias while better preserving overall model performance compared to modifying the entire model.
These investigations focus primarily on English language models, with selected analyses in Dutch.

Together, these studies offer new perspectives on bias measurement, provide insight into how biased associations develop during training, and illustrate how targeted, mechanism-informed interventions can reduce certain forms of bias under specific conditions.
More broadly, the thesis argues that understanding and reducing bias in language models requires carefully validated measurement in realistic contexts and attention to how models encode and use information internally, rather than reliance on isolated benchmark tests.
However, while mechanistic analysis and targeted interventions provide valuable tools, they are insufficient in isolation; addressing real-world harms ultimately requires integrating them with scenario-grounded evaluation and the study of deployed systems.

Item Type:	Thesis (Doctoral)
Report Nr:	DS-2026-07
Series Name:	ILLC Dissertation (DS) Series
Year:	2026
Subjects:	Language Logic
Depositing User:	Dr Marco Vervoort
Date Deposited:	17 Mar 2026 16:12
Last Modified:	09 Apr 2026 10:50
URI:	https://eprints.illc.uva.nl/id/eprint/2414

Actions (login required)

View Item