Year of Award
2026
Document Type
Dissertation
Degree Type
Doctor of Philosophy (PhD)
Degree Name
Individualized Interdisciplinary Doctoral Program
Department or School/College
Interdisciplinary Studies Program
Committee Chair
Travis Wheeler
Commitee Members
John Chandler, Doug Brinkerhoff, Jon Graham, Lucy Owen
Keywords
Artificial Intelligence, Large Language Models, Model Alignment, Model Interpretability, Natural Language Processing, Spatial Statistics
Abstract
Large language models (LLMs) have demonstrated remarkable linguistic capabilities, but the internal mechanisms driving their performance and behavior remain poorly understood. This dissertation investigates the interpretability of transformer-based language models from two complementary perspectives: the geometric organization of model internals in encoder-only models, and the mechanisms underlying safety behavior in decoder-only generative models.
In Part I, we address the challenge of measuring geometric properties in high-dimensional latent spaces, proposing and evaluating alternative measures of data spread that improve upon commonly used metrics. We then apply these measures alongside quantization-based metrics to examine the relationship between latent space geometry and downstream benchmarking performance, finding that a quantized cell density measure has a strong linear relationship with GLUE performance in a series of synthetically perturbed BERT-family models. We further explore how pre-training data scale, training task, and hyperparameter configuration shape the resulting model weight distributions, observing that training scale and hyperparameter choices have a more pronounced effect on weight distributions than training task.
In Part II, we investigate refusal behavior in Mixture of Experts (MoE) generative models, extending an existing activation steering method to MoE architectures and introducing expertaware steering methods that isolate the contributions of individual model components. Our results demonstrate that refusal behavior is not localized to the MoE feed-forward sublayer, but is distributed across the feed-forward and attention sublayers, with evidence suggesting two distinct refusal pathways: an internal pathway mediated by the feed-forward sublayer and a contextual pathway mediated by attention. We also observe evidence of post-training behavioral entanglement and non-linear geometry in the model’s latent representations.
Together, these findings contribute to a growing understanding of the internal organization of transformer language models and highlight the complexity of the relationship between model internals, downstream performance, and learned behavior.
Recommended Citation
Marbut, Anna, "SEARCHING FOR NOOKS AND CRANNIES: GEOMETRIC AND MECHANISTIC PERSPECTIVES ON TRANSFORMER LANGUAGE MODEL INTERPRETABILITY" (2026). Graduate Student Theses, Dissertations, & Professional Papers. 12640.
https://scholarworks.umt.edu/etd/12640
© Copyright 2026 Anna Marbut