Graduate Student Theses, Dissertations, & Professional Papers

SEARCHING FOR NOOKS AND CRANNIES: GEOMETRIC AND MECHANISTIC PERSPECTIVES ON TRANSFORMER LANGUAGE MODEL INTERPRETABILITY

Anna Marbut

Year of Award

2026

Document Type

Dissertation

Degree Type

Doctor of Philosophy (PhD)

Degree Name

Individualized Interdisciplinary Doctoral Program

Department or School/College

Interdisciplinary Studies Program

Committee Chair

Travis Wheeler

Commitee Members

John Chandler, Doug Brinkerhoff, Jon Graham, Lucy Owen

Keywords

Artificial Intelligence, Large Language Models, Model Alignment, Model Interpretability, Natural Language Processing, Spatial Statistics

Abstract

Large language models (LLMs) have demonstrated remarkable linguistic capabilities, but the internal mechanisms driving their performance and behavior remain poorly understood. This dissertation investigates the interpretability of transformer-based language models from two complementary perspectives: the geometric organization of model internals in encoder-only models, and the mechanisms underlying safety behavior in decoder-only generative models.

In Part I, we address the challenge of measuring geometric properties in high-dimensional latent spaces, proposing and evaluating alternative measures of data spread that improve upon commonly used metrics. We then apply these measures alongside quantization-based metrics to examine the relationship between latent space geometry and downstream benchmarking performance, finding that a quantized cell density measure has a strong linear relationship with GLUE performance in a series of synthetically perturbed BERT-family models. We further explore how pre-training data scale, training task, and hyperparameter configuration shape the resulting model weight distributions, observing that training scale and hyperparameter choices have a more pronounced effect on weight distributions than training task.

In Part II, we investigate refusal behavior in Mixture of Experts (MoE) generative models, extending an existing activation steering method to MoE architectures and introducing expertaware steering methods that isolate the contributions of individual model components. Our results demonstrate that refusal behavior is not localized to the MoE feed-forward sublayer, but is distributed across the feed-forward and attention sublayers, with evidence suggesting two distinct refusal pathways: an internal pathway mediated by the feed-forward sublayer and a contextual pathway mediated by attention. We also observe evidence of post-training behavioral entanglement and non-linear geometry in the model’s latent representations.

Together, these findings contribute to a growing understanding of the internal organization of transformer language models and highlight the complexity of the relationship between model internals, downstream performance, and learned behavior.

Recommended Citation

Marbut, Anna, "SEARCHING FOR NOOKS AND CRANNIES: GEOMETRIC AND MECHANISTIC PERSPECTIVES ON TRANSFORMER LANGUAGE MODEL INTERPRETABILITY" (2026). Graduate Student Theses, Dissertations, & Professional Papers. 12640.
https://scholarworks.umt.edu/etd/12640

Download

COinS

ScholarWorks at University of Montana

Graduate Student Theses, Dissertations, & Professional Papers

SEARCHING FOR NOOKS AND CRANNIES: GEOMETRIC AND MECHANISTIC PERSPECTIVES ON TRANSFORMER LANGUAGE MODEL INTERPRETABILITY

Year of Award

Document Type

Degree Type

Degree Name

Department or School/College

Committee Chair

Commitee Members

Keywords

Abstract

Recommended Citation

Search

Browse

Author Corner

Links

ScholarWorks at University of Montana

Graduate Student Theses, Dissertations, & Professional Papers

SEARCHING FOR NOOKS AND CRANNIES: GEOMETRIC AND MECHANISTIC PERSPECTIVES ON TRANSFORMER LANGUAGE MODEL INTERPRETABILITY

Author

Year of Award

Document Type

Degree Type

Degree Name

Department or School/College

Committee Chair

Commitee Members

Keywords

Abstract

Recommended Citation

Share

Search

Browse

Author Corner

Links