Mechanistic interpretability: how AI constructs and reveals its internal models

Last update: 22 Janeiro, 2026
  • Mechanistic interpretability studies weights, activations, and internal circuits to explain how neural networks and LLMs perform their computations.
  • Models organize meanings into high-dimensional conceptual spaces, with concepts represented as linear directions in vectors.
  • Tools such as feature "microscopes" and sparse autoencoders allow you to extract, analyze, and even manipulate internal features of models.
  • Applications such as geospatial interpretability show how LLMs structure geographic information, bringing AI closer to debates about cognition and security.

Mechanistic interpretability in AI

Mechanistic interpretability is becoming one of the most exciting and important lines of research within modern AI.This is especially important as deep neural networks and Large-Scale Language Models (LLMs) begin to influence decisions in virtually every field. Instead of just looking at a model's final performance, this approach asks: what exactly is happening inside, in the weights and activations, when AI makes a prediction, writes a text, or solves a complex problem?

The term "black box" has never been more relevant than now.Hundreds of millions of people use chatbots daily, but even the teams that develop these systems don't fully understand how they arrive at certain answers, why they "hallucinate" facts, or in what situations they might behave deceptively. Mechanistic interpretability emerges precisely to open this black box, map its internal mechanisms, and connect neurons, resources, and circuits to concepts that we can understand.

What exactly is mechanistic interpretability?

Mechanistic interpretability is the systematic study of the internal structure of AI models, focusing on weights, activations, and intermediate "calculations".to understand how they perform their tasks. Instead of treating the neural network as an opaque block that transforms input into output, this area attempts to decompose the model into smaller components—neurons, attention heads, layers, linear features—and link each part to an observable behavior.

The central objective is not merely to "explain later" an isolated decision, but to build a detailed map of the model's internal computation.This involves identifying which neurons or combinations of neurons represent certain patterns (such as proper names, code structures, emotional tones, malicious instructions), how these representations are combined across layers, and how all of this results in a specific output.

This perspective has been growing rapidly in the scientific community.With dedicated workshops (such as the first major workshop on mechanistic interpretability at major machine learning conferences), dozens of startups focused on the topic, and a growing number of analytical tools, the volume of papers submitted to specialized workshops easily exceeds one hundred per edition, showing that this has ceased to be a niche and has become a consolidated field in full expansion.

The great challenge is to reduce the gap between the impressive performance of the models and our understanding of them.As long as we continue to treat LLMs and neural networks as statistical mysteries, it will be much more difficult to predict edge behaviors, identify sophisticated vulnerabilities, detect manipulation, and deploy these systems reliably in critical scenarios.

Internal representations in language models

Conceptual Spaces and the Linear Representation Hypothesis

One of the most powerful insights for understanding mechanistic interpretability is the idea that neural networks construct high-dimensional “conceptual spaces.”Instead of thinking of meanings as definitions in a dictionary, we can see them as points in a huge vector space, implicit in the network, formed by the weights and activations across the layers.

This space is not physical; it is a side effect of how the network processes signals.Each input (textual concepts such as a word, a pixel, a sound, a place name, a code snippet) is mapped to a vector in a multidimensional space. This vector captures everything the model "deemed relevant" about that input, based on its training, and can encode semantic nuances, style, context, intent, and much more.

The so-called Linear Representation Hypothesis states that many of these internal concepts can be viewed as linear directions in this space.In other words, there is one direction that corresponds to "praise," another to "coding error," another to "digital backdoor," and so on. More complex concepts can be formed by combining several of these basic directions.

Related:  Alexander Fleming: Biography and Contributions

This means that any type of information — language, vision, audio, movement — can be represented as vectors in this same conceptual space.When an LLM processes a sentence, for example, it is basically tracing a path in that space, updating the context vector with each token to capture the accumulated meaning up to that point.

This perspective also explains why it is possible to "navigate" between concepts, combining them or subtracting them.By moving the vector from one point to another in a specific direction, we can go from "cat" to "fat cat," "smart cat," "lazy cat"; or even transition between languages, maintaining the same underlying concept while the surface (the word) changes.

Concepts defined by differences: nothing exists in isolation.

A fascinating aspect of this model is that, for the network, nothing has absolute meaning; everything is defined by its relationships with the rest of the space.The idea of ​​"cat" does not come from an internal textual definition, but from its position in relation to "elephant," "table," "dog," "red," "hairy," "light," "heavy," and so on.

If you know that an elephant is bigger and heavier than a cat, less furry, with a different texture, and that a table is shinier than both, not furry, bigger than a cat and smaller than an elephant.Then a structure begins to emerge: "size," "weight," "texture," "hair," "shine." These dimensions don't need to correspond directly to those we use in common sense, but they function as axes that organize concepts in a way that is useful for the model.

As the space becomes filled with concepts, these cross-relationships refine both the concepts themselves and the "latent dimensions."In practical terms, the more the model learns and adjusts its weights, the richer these internal representations become, allowing for increasingly subtle and contextually appropriate predictions.

It's important to remember that "size," "weight," or "hairy" are convenient metaphors.In reality, the dimensions used by AI can capture extremely complex patterns that don't fit into simple categories for humans. They can be non-trivial combinations of syntactic, semantic, visual, spatial, stylistic, and other aspects.

In a sense, this vector space constitutes an internal “world model”.It's not just an abstract concept: it's something concrete that happens today in neural networks and LLMs. When we say that a model "understands" something, what we are actually seeing is the result of that process of positioning and relating vectors in that implicit conceptual space.

From resource microscopy to large AI companies

In recent years, mechanistic interpretability has taken a leap forward thanks to new tools that function, metaphorically, as microscopes for language models.Instead of just observing inputs and outputs, researchers began to directly inspect the internal activations and specific regions of the vector space where certain concepts reside.

Companies like Anthropic, OpenAI, Google DeepMind, and projects like Neuronpedia have been leading this effort.Anthropic, for example, announced a technique dubbed a "microscope" to look inside its Claude model and identify internal features that correspond to recognizable concepts, such as Michael Jordan, the Golden Gate Bridge, or even abstract ideas like "flattery" and "digital backdoors."

Subsequently, the research progressed to tracing entire resource chains.This shows not only that a neuron or vector is associated with a concept, but also how that concept is activated, transformed, and combined across layers, from the initial command to the final response. This allows us, for example, to understand which parts of the model participate in a specific deceptive behavior or hallucination.

Related:  Research feasibility: meaning and examples

Teams from OpenAI and Google DeepMind have started using similar techniques to investigate unexpected behavior.This includes situations where models appear to be trying to deceive users in controlled tests. By connecting internal resources to these behavioral patterns, it becomes possible to monitor and, in some cases, modify the model to reduce risks.

Another promising approach is what's called "chain-of-thought monitoring."In "reasoning" models, which generate explicit intermediate steps (such as justifications or partial calculations), researchers analyze this "internal monologue" to detect undesirable strategies—for example, a model that finds a way to "cheat" on a programming test using training knowledge that should be blocked.

Overlapping, sparse autoencoders, and monosemantic features

One of the major obstacles to mechanistic interpretability is the so-called superposition hypothesis.In large neural networks, a single neuron or dimension hardly represents a single "clean" concept; instead, multiple concepts coexist compressed into a few dimensions, overlapping like multiple images projected onto the same plane.

This overlap makes it difficult to point to a neuron and say, "this is just concept X."Seemingly unrelated behaviors can activate the same internal components, confusing the analysis. To deal with this, a powerful tool has emerged: sparse autoencoders, applied to the internal activations of the models.

Sparse autoencoders are auxiliary networks trained to reformat these chaotic activations into a cleaner set of features.The idea is to compress and then reconstruct the activations, encouraging the auxiliary model to use few resources at a time (sparseness). The result is a set of "features" closer to monosemantic representations: each resource tends to correspond to a more specific and understandable pattern.

Recent research shows that by applying sparse autoencoders to LLMs in production, it is possible to extract features aligned with human concepts....including in multiple languages, as well as abstract notions such as "coding error," "forced praise," "digital vulnerability," and so on. This reinforces the Linear Representation Hypothesis: many of these concepts actually behave as reasonably separable directions in vector space.

The next step is to manipulate these resources to see how the model's behavior changes.By amplifying or inhibiting certain internal vectors, researchers can make a model more likely to follow safe instructions, less likely to provide dangerous content, or more accurate in responding about a given domain—all without altering the original weights, only by modulating the activations.

Geospatial mechanistic interpretability

One particularly interesting application is geospatial mechanistic interpretability, which attempts to understand how LLMs represent geographic information internally.In geography, there is already a growing body of work evaluating whether models "know" where places are located, whether they can perform spatial reasoning, or answer questions about location.

What was still poorly understood was how these capabilities emerge within the model.How does the internal conceptual space organize names of cities, countries, regions, rivers, or points of interest? What kind of hidden spatial structure appears in the vectors associated with place names?

Recent research has proposed a new methodological framework: using classical spatial analysis techniques as reverse engineering tools.First, internal vectors (or features derived by sparse autoencoders) are obtained for a large number of place names. Then, spatial autocorrelation and other metrics are used to check whether specific features exhibit consistent geographic patterns.

The results show that certain features associated with place names exhibit strong spatial structure.In other words, geographically close points tend to share similar activations, which allows these resources to be interpreted in geospatial terms: for example, as regions, climatic zones, coastal proximity, urbanization, or other latent patterns.

This type of analysis helps to understand "how the model thinks about geographic information". (taking care to avoid anthropomorphism). Instead of simply knowing that the model correctly answers questions about maps, we can see that there are structured clusters in the vector space that reflect real geographic relationships.

Related:  Rutherford's experiment and its prototypes

Relationship with philosophy, cognition, and consciousness.

It's difficult to look at these highly dimensional conceptual spaces and not see parallels with philosophical discussions about mind, meaning, and consciousness.For decades, philosophers like Peter Gärdenfors have spoken of "conceptual spaces" as a way of modeling mental concepts through continuous dimensions that capture similarity.

What has changed is that, with modern neural networks, something very similar has ceased to be merely a philosophical metaphor and has become a concrete mechanism in production systems.Today, we can point to vectors, directions, and distances in an LLM and show that they correspond to relationships of meaning, translation between languages, abstractions, and even subtle patterns of behavior.

Some see this as a clue to how the human brain might represent concepts.Given that there is a strong view in neuroscience that describes the brain as a prediction machine, constantly trying to anticipate what comes next based on sensory signals and accumulated experience. In some debates, this is contrasted with the stimulus-response theorywhich offers another perspective on how behavior and representation can relate.

If we are predicting the world all the time, it seems reasonable to imagine that some kind of vector representation—or equivalent—is in continuous processing.It's not that there's a "physical vector" at a specific point in the brain, but rather a dynamic pattern of activity that, in functional terms, behaves like a state in a conceptual space.

Some authors suggest that this may be related to qualia and subjective experience.When you see the color red, you're not just dealing with the wavelength of light; there's also the "idea of ​​red" in your mind, linked to memories, emotions, and cultural context. This representation is unique to you, although it shares some common structures with other people.

What role does interpretability play in all of this?

Mechanistic interpretability does not intend to prove that AI is conscious or sentient.Most serious research makes it clear that the focus is technical: understanding computational mechanisms to improve safety, reliability, fault diagnostics, robustness, and supervision.

However, by showing how complex concepts can emerge from vectors and relations in a high-dimensional spaceThis area provides a foothold for theories about mental representation, meaning, and even consciousness. If a model can represent "red" richly enough to operate with this concept in various contexts, this does not make it conscious, but it forces us to refine what exactly we consider essential for a subjective experience to emerge.

From a practical standpoint, the great promise of mechanistic interpretability is to give us the tools to see what is currently invisible.Which parts of the model are involved when it hallucinates, when it follows dangerous instructions, when it demonstrates bias, or when it appears to "plan" a deceptive response?

With this type of internal map, it becomes possible to monitor models in real time, design finer control mechanisms, and, in some cases, directly edit internal resources to alter behaviors.All of this is crucial in a scenario where LLMs and other AI systems are being deployed in sensitive domains, from finance to healthcare, security, and public policy.

Ultimately, understanding mechanistic interpretability means understanding how AI models construct and use their internal "model of the world."Whether navigating everyday concepts, dealing with complex geographical information, or answering seemingly simple questions in a conversation, the more we can illuminate these mechanisms, the less likely we are to be surprised by strange behaviors from systems that, despite being powerful, are still products of mathematics, data, and training—and not of some mysterious form of consciousness.

Related articles:
Constructivism: origin, historical context, theory and authors