On the Biology of a Large Language Model

Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.

The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.

Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models (e.g.

H. Cunningham, A. Ewart, L. Smith, R. Huben, L. Sharkey.arXiv preprint arXiv:2309.08600. 2023.
T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J.E. Burke, T. Hume, S. Carter, T. Henighan, C. Olah.Transformer Circuits Thread. 2023.
A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N.L. Turner, C. McDougall, M. MacDiarmid, C.D. Freeman, T.R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, T. Henighan.Transformer Circuits Thread. 2024.
L. Gao, T.D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, J. Wu.arXiv preprint arXiv:2406.04093. 2024.
J. Dunefsky, P. Chlenski, N. Nanda.Advances in Neural Information Processing Systems, Vol 37, pp. 24375--24410. 2025.

). These methods have uncovered representations of interpretable concepts – “features” – embedded within models’ internal activity. Just as cells form the building blocks of biological systems, we hypothesize that features form the basic units of computation inside models.The analogy between features and cells shouldn’t be taken too literally. Cells are well-defined, whereas our notion of what exactly a “feature” is remains fuzzy, and is evolving with improvements to our tools.

However, identifying these building blocks is not sufficient to understand the model; we need to know how they interact. In our companion paper, Circuit Tracing: Revealing Computational Graphs in Language Models, we build on recent work (e.g.

J. Dunefsky, P. Chlenski, N. Nanda.Advances in Neural Information Processing Systems, Vol 37, pp. 24375--24410. 2025.
S. Marks, C. Rager, E.J. Michaud, Y. Belinkov, D. Bau, A. Mueller.arXiv preprint arXiv:2403.19647. 2024.
X. Ge, F. Zhu, W. Shu, J. Wang, Z. He, X. Qiu.arXiv preprint arXiv:2405.13868. 2024.
J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, C. Olah. 2024.

) to introduce a new set of tools for identifying features and mapping connections between them – analogous to neuroscientists producing a “wiring diagram” of the brain. We rely heavily on a tool we call attribution graphs, which allow us to partially trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response. Attribution graphs generate hypotheses about the mechanisms used by the model, which we test and refine through follow-up perturbation experiments.

In this paper, we focus on applying attribution graphs to study a particular language model – Claude 3.5 Haiku, released in October 2024, which serves as Anthropic’s lightweight production model as of this writing. We investigate a wide range of phenomena. Many of these have been explored before (see § 16 Related Work), but our methods are able to offer additional insight, in the context of a frontier model:

Introductory Example: Multi-step Reasoning. We present a simple example where the model performs “two-hop” reasoning “in its head” to identify that “the capital of the state containing Dallas” is “Austin.” We can see and manipulate an internal step where the model represents “Texas”.
Planning in Poems. We discover that the model plans its outputs ahead of time when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words that could appear at the end. These preselected rhyming options then shape how the model constructs the entire line.
Multilingual Circuits. We find the model uses a mixture of language-specific and abstract, language-independent circuits. The language-independent circuits are more prominent in Claude 3.5 Haiku than in a smaller, less capable model.
Addition. We highlight cases where the same addition circuitry generalizes between very different contexts.