Fraser Scott

Investigating list indexing with linear probes

2026-05-15T00:00:00+00:00

How do LLMs perform list indexing? Namely, how do they complete prompts like:

>>> nums = [2,8,1,9,7,4]
>>> nums[4]

Intuitively, models must compute the correct answer (7) somewhere, perhaps in a particular layer and token position. We can try and identify that point with linear probes.

Linear probes essentially tell us whether some piece of information is present in a models residual stream. If a linear probe is robustly able to recover some piece of information, it is as though that information is a “variable” in the model’s working memory, ready to be used at later steps.

To start investigating this, I generated a dataset of prompts with randomized lists of length 6 and randomized indices, formatted as Python code:

Dataset size: 600
Samples per position: 100

Example prompts:
  '>>> nums = [4,3,3,7,0,0]\n>>> nums[2]\n'
    → position=2, target=3
  '>>> nums = [4,0,2,3,2,3]\n>>> nums[1]\n'
    → position=1, target=0
  '>>> nums = [4,9,5,1,6,5]\n>>> nums[5]\n'
    → position=5, target=5

The model gemma-3-4b-pt is able to reliably perform this task, predicting the correct value with 97% accuracy:

Overall accuracy: 97.0% (194/200)

Accuracy by queried position:
  Position 0: 100.0% (31/31)
  Position 1: 100.0% (39/39)
  Position 2: 100.0% (32/32)
  Position 3: 93.8% (30/32)
  Position 4: 97.5% (39/40)
  Position 5: 88.5% (23/26)

This is impressive; there are 10^6 possible lists, so the model is probably doing something more sophisticated than memorising every possible answer. Note its accuracy decreases slightly in the later positions, although not by much.

Handily, the Gemma 3 tokenizer individually tokenizes each digit in the prompt, so it will be easy to compare the model’s behaviour on particular indices in the list:

Token breakdown:
  pos  0:      2 → ''
  pos  1:  22539 → '>>>'
  pos  2:  27536 → ' nums'
  pos  3:    578 → ' ='
  pos  4:    870 → ' ['
  pos  5: 236771 → '0'
  pos  6: 236764 → ','
  pos  7: 236800 → '3'
  pos  8: 236764 → ','
  pos  9: 236800 → '3'
  pos 10: 236764 → ','
  pos 11: 236770 → '1'
  pos 12: 236764 → ','
  pos 13: 236812 → '4'
  pos 14: 236764 → ','
  pos 15: 236825 → '6'
  pos 16: 236842 → ']'
  pos 17:    107 → '\n'
  pos 18:  22539 → '>>>'
  pos 19:  27536 → ' nums'
  pos 20: 236840 → '['
  pos 21: 236778 → '2'
  pos 22: 236842 → ']'
  pos 23:    107 → '\n'

Total tokens: 24

Now, we can start probing. We will focus on probing the last token position ('\n'), because it has the full context through attention. In theory, the residual stream at that position may encode all of the information that the model has derived from the prompt - the elements in the list, and eventually the value of the target index (ie. nums[i]). It is very possible that this information can be recovered from other token positions however, at even earlier layers.

To motivate the use of linear probes, it is worth pondering how the model might encode all of this information into a single residual stream. One obvious hypothesis would be that is has index-specific digit directions, ie. there is a direction for “index 0 is 4,” a separate direction for “index 1 is a 4”, and so on. With only 10 digits × 6 positions = 60 directions, this seems feasible. In any case, as long the model is using linear representations, then linear probes will find the best projection to extract it.

We will first try to answer a basic question - can each element of the list be recovered from the residual stream of the final token position? If every element is present in the residual stream at all times, it becomes more nuanced to identify when the model has isolated the target element. Our ultimate goal will be to detect if the correct answer is more recoverable than the other elements.

To establish this baseline, we first train “per-index” probes for each layer. Specifically, given only the model’s activations for a single layer, the probe outputs a label from 0-9 predicting the value of the list at its index:

Despite only having access to the models activations, the probes can learn how the model encodes information and reliably extract the value of each index. If the model does not linearly represent an element in the residual stream, then the corresponding probe will not have enough signal to achieve high accuracy. In that case the probe has a 1 in 10 chance of guessing the correct answer.

With lists of length 6 and 35 layers, there are 210 per-index probes; luckily they are cheap to train. We use our dataset to sample residual stream activations for each layer, then fit linear probes to the data with a sklearn LogisticRegression.

We can then plot the per-layer accuracy for each index:

The family of probes for a given index is represented by a coloured line. Each datapoint indicates how successful a probe is at recovering its list element at that layer. For example, the probe for index 1, layer 16 has an accuracy of 60%; the value of the 2nd element in the list can reliably be recovered from the residual stream at layer 16.

There are a few trends. None of the elements fall below the chance line of 10%; they are all always present in the residual stream. The model does indeed move information about each element to the final token position.

The value of 1st element in the list is always prominent in the residual stream, never dipping below 70% accuracy. This element might be treated specially by the model; arr[0] is a common pattern. The 2nd and 6th values are initially amplified, but gradually decrease between layers 13 and 20. The remaining elements hover around the 35-60% range.

With this established, we can now try and identify when the model isolates the target digit. That is, out of all the elements present in the residual stream, which is the “correct” one to output based on nums[i]? To do this, we train a “target probe” for each layer that, given an activation, learns to output the value of the list at the target index:

With these probes trained, we can overlay their accuracy onto the per-index diagram:

We see a clear trend - an almost inverse relationship between the linear recoverability of the target digit and the other elements. At layer 13, the accuracy of the target digit shoots from 30% to 85%, and asymptotes to nearly 100% in the subsequent layers; meanwhile the accuracy of the other digits gradually levels off.

To answer our initial question, it would appear that the model identifies the correct answer on or before layer 13. The correct answer is amplified heavily, and the model starts “ignoring” the other elements, which are slowly overwritten in the residual stream.

Of course, while this graph offers a intriguing picture, there are a few caveats to this approach. High probe accuracy also does not imply that the model uses that information causally - it may be a byproduct of the computation rather than an input to it. Verifying this would require causal interventions like activation patching. In addition, we only probed the final token position; the model may perform key computations at other positions earlier in the forward pass.

More investigation is needed…

Injecting country features in Gemma2-2b

2026-04-24T00:00:00+00:00

Circuit tracing lets us identify interpretable features in language models. But if we find a feature relating to some concept, does it activate on all prompts involving that concept? And do prompts involving different concepts use analogous features in the same role?

In this post, I identify country-specific features that consistently activate when a model is prompted about cities in a given country. I then test whether these features are interchangeable by injecting the feature for one country into a prompt about another.

If we ask Gemma2-2b to predict the currency of Manchester, but amplify the India feature while negating the UK feature, it answers Indian Rupee!

To see the full experiment, read on.

The tools
The dataset
The country features
Feature injection
Conclusion

The tools

For this experiment, I used the circuit-tracer library in a Google Colab notebook, with a T4 GPU. You can follow along with my notebook here.

I also used the Circuit Tracer in Neuronpedia to manually identify features.

The dataset

In this task, we want the model to predict the currency of the country containing a given city. I chose the following multi-shot prompt (the clear format elicits better performance):

prompt_template = "Paris:EUR :: Tokyo:JPY :: {city}:"

The model is generally quite good at this task and predicts the correct ticker for a range of cities.

To be confident that the feature injection experiment would work, I chose India and the United Kingdom, two well-represented countries that the model likely has strong, distinct features for. I listed six cities for each country:

# Define dataset
countries = {
    "India": {
        "ticker": "INR",
        "cities": ["Mumbai", "Delhi", "Chennai", "Kolkata", "Bangalore", "Hyderabad"],
    },
    "United Kingdom": {
        "ticker": "GBP",
        "cities": ["London", "Manchester", "Birmingham", "Edinburgh", "Liverpool", "Bristol"],
    },
}

It is important to verify that the model can actually do this task; that its highest confidence prediction for each city is the correct ticker:

As expected, the model robustly predicts the correct token. This is impressive for a small model, and may suggest that it is doing something more sophisticated than just memorising every possible city/currency pair.

The country features

This experiment assumes that the model has features representing individual countries, and that these are somehow used to output the correct answer. If we can find these features and ablate them, the model should output the wrong currency. In my previous post, I identified these features for a different country, so I was confident more would exist.

To find these features, I simply generated attribution graphs for a city in each country (Manchester and Mumbai) on Neuronpedia. Then I manually noted down any features with the name of the country in their automated label, focusing on the later layers of the final token position.

For example, the features for India (in black) can be seen here in the top-right:

This surfaced 4 features for India and 3 for the UK.

# Define features
india_features = [
    Feature(layer=20, pos=None, feature_idx=9559),   # Indian
    Feature(layer=21, pos=None, feature_idx=607),     # India-related business texts
    Feature(layer=20, pos=None, feature_idx=8828),    # India
    Feature(layer=18, pos=None, feature_idx=12013),   # India
]

uk_features = [
    Feature(layer=20, pos=None, feature_idx=3744),    # UK
    Feature(layer=18, pos=None, feature_idx=10922),   # England/Britain
    Feature(layer=23, pos=None, feature_idx=5510),    # United Kingdom / United States
]

Then, I verified that these candidate features only activated on the cities of that country:

As expected, the supernode for a given country activates strongly when the model is asked about that country, but has zero activation when asked about others.

Similarly, I tested that negating each supernode “breaks” the model’s output on cities in that country.

When each country’s supernode is negated, the model instead answers USD (or EUR) for that country!

These features clearly play a role. In general, it is promising that the model appears to have analogous representations of individual countries in the final layers.

Feature injection

We can now attempt to “inject” the activations for one country’s features into the prompt for the other. If these two supernodes do indeed serve the same role in this currency circuit, but for different countries, then we can use them interchangeably.

Of course, the India supernode does not naturally activate on UK prompts. To address this, we first run a prompt for an Indian city, e.g. Mumbai and capture all the activations from that forward pass. Then we can extract the India supernode activations specifically and inject them into other prompts.

# Run the Mumbai prompt to capture India feature activations
source_prompt = "Paris:EUR :: Tokyo:JPY :: Mumbai:"
_, source_activations = model.get_activations(source_prompt)

# Run the Manchester prompt to capture UK feature activations
prompt = "Paris:EUR :: Tokyo:JPY :: Manchester:"
logits, activations = model.get_activations(prompt)

# Build the UK node from the Manchester prompt (naturally active)
uk_node = Supernode(name="United Kingdom", features=uk_features)
graph.initialize_node(uk_node, activations)

# Build the India node using activations sourced from Mumbai
india_node = Supernode(name="India", features=india_features)
india_source = Supernode(name="India_source", features=india_features)
source_graph.initialize_node(india_source, source_activations)
india_node.default_activations = india_source.default_activations

# Intervene: suppress UK, amplify India
supernode_intervention(graph, [Intervention(uk_node, -2), Intervention(india_node, 2)])

We can visualise this for a specific prompt. First, without any interventions, the Manchester embedding activates the United Kingdom supernode, and the model predicts GBP:

However, if we negate the United Kingdom supernode and inject the India supernode, the model predicts INR:

Note that we use a scaling factor of -2x rather than simply zeroing out the UK features. In theory, 0x would remove the UK signal, but I found this produced cleaner results than just ablating in practice.

With this procedure established, we can inject “India” into all the UK prompts and vice versa:

As hoped, the model predictions completely changed - GBP became the top prediction for Indian cities and INR for the United Kingdom! In addition, each country’s own currency is completely absent from its predictions due to the negative steering.

Conclusion

This experiment provides a neat interpretability result. Not only does the model have distinct features for particular countries, but they are actually used interchangeably.

To expand on this experiment, I would test it on more countries. My initial tests showed that injection did not work on less well-represented countries - it is possible that the features do not exist or the computation is more distributed.

In addition, I would like to try automatically discovering features by comparing their activations on various inputs. I am also keen to establish a rigorous “end to end” understanding of this currency circuit or another format of prompt, beyond looking at single features.

How does Qwen3-4B recall currencies?

2026-04-20T00:00:00+00:00

After spending the last year designing tasks that are hard for AI, I have become interested in the question of how LLMs think, specifically how they are able to solve novel problems.

To start to answer this question, I looked to the obvious place - Mechanistic Interpretability. A key idea there is circuit tracing, that models have “features” which are used in “circuits” to predict the output.

As a first go, I investigated how a model answers a very much not novel prompt:

Fact: the currency ticker of the country containing Medellín is

When Qwen3-4B completes this prompt, it correctly answers “COP”. But not only that - it has internal representations of currencies, countries, Colombia, The Americas, and more!

If you are curious about this graph, read on.

Circuit tracing
Choosing the prompt
Features
- Input features
- Output features
  - The “say uppercase text” feature
  - The “say a geographic location” feature
The attribution graph
Conclusion

Circuit tracing

I followed the circuit tracing approach outlined by Anthropic here. In brief, they train a “replacement model” which approximates the output of an LLM, but with more interpretable components.

The tool I used is the Neuronpedia Circuit Tracer. For the unfamiliar, this tutorial gives a good introduction to the application and circuits in general.

The basic hypothesis of circuits is intuitive; models are probably learning various concepts and synthesising them in learned algorithms. If we can identify these features, we can see when they activate and what happens when they are ablated.

Choosing the prompt

I wanted a prompt that requires multi-step reasoning but has an unambiguous answer. I chose the following:

Fact: the currency ticker of the country containing Medellín is

To complete this prompt, the model must recall a) which country Medellin is in, and b) its currency ticker. Hopefully, this reasoning chain is reflected as a circuit in the model. This format is common in circuit tracing, so I was confident it would work.

The model correctly predicts “COP” with 81% probability.

I chose Qwen3-4B because it was the biggest model with steering (i.e. ablating features).

As for the formatting, the “Fact:” prefix just promotes a succinct response from the model. Also, Qwen3-4B is instruction-tuned, so the full prompt follows a user/assistant format. Conceptually the model is predicting the next token that a user would say.

Features

With the setup complete, Neuronpedia generates an attribution graph for the prompt.

That’s a lot! The input tokens run along the bottom, and the next token predictions are at the top right. The features take up the middle, separated by layer.

A useful heuristic is to try to identify two types of features:

Input features tend to appear in earlier layers and activate on particular input tokens. They are like the “source” of some information flowing through the network.

Output features tend to activate in later layers. They activate in response to input features and influence the next token that the model will predict. For this reason they are often denoted by “say X”.

For example, take the prompt:

The national dish of Italy is

Imagine that a “food” input feature activates on the token “dish”. If this is ablated, the model will kind of forget that the question is about food, and give another answer, like the national animal.

Suppose also that there is a “say a food” output feature. If this is ablated, the model will not immediately say the answer, but give an indirect completion like “…is called pizza”. It does not forget the question (the “food” feature is still present), but its response is temporarily suppressed.

In practice, each feature is an input and output to many others, but this heuristic gets us quite far.

Input features

We can find input features by inspecting the features activated by each token in the prompt. Handily, Neuronpedia provides automated labels for features based on their activations.

Here are a few that I found.

The “Colombia” feature

Starting with the “in” token (i.e. Medellín), there are a number of features activated at this position that reference Colombia.

Interestingly, these features (in pink) appear in the middle layers of the network. Perhaps it takes a few layers for attention to mix the three tokens of “Med/ell/ín” into a single concept.

Looking at their top activations, these features activate on places and concepts related to Colombia.

If all these features are ablated, the model outputs “MXN”!

There is clearly still Latin American influence in the answer. Well, if the “Americas” feature is also ablated, the model outputs “ZM” (Zambia)!

The “country” feature

Looking at the token “country”, there are a number of features labelled “country”, “nation”, etc.

If these are ablated, the model answers “USD”. This intervention probably discourages the model from answering for a specific country, so it just outputs a generic currency.

The “currency ticker” feature

For the token “ticker”, there are various features labelled “currency”, “codes”, etc.

As would be expected, these activate on financial terms and codes.

If these features are ablated, the model gives a strange completion counting to 9. I’m not sure how to interpret this, but this feature is clearly important to the output.

Output features

To find output features, we can see which features influenced the model’s prediction.

These are more difficult to interpret as there is no dedicated “say COP” feature. Also, a number of the activations for this prompt come from error nodes, which means there was a mismatch between the replacement model’s output and the underlying model; a feature cannot be attributed to the output.

The “say uppercase text” feature

Looking at the “COP” prediction, one of the top inputs is an “uppercase text” feature which activates on capital letters.

In ablating this feature, the model gives a longer answer starting with “Colombian pesos”.

This feature probably makes the model output a ticker specifically, instead of the full name of the currency. By intervening, we essentially force the model to reach the answer in an indirect way.

The “say a geographic location” feature

Another top feature is labelled “geographic locations”. It activates mostly when completing Latin American place names.

If this is ablated, the completion is “USD”. It seems that this feature, perhaps working in concert with others, influences the model to answer about Colombia specifically.

The attribution graph

Now we have the important features, they can be visualised in a subgraph.

So, how does this model complete this prompt?

My interpretation is that the prompt activates features for specific countries (i.e. Colombia), and the concepts of currency tickers and countries in general. These then activate output features which separately make the model say a) the currency of a specific country and b) to format it as a ticker.

Of course, this is not a rigorous analysis. There are likely many more features that play a role, and it is unclear exactly how they interact.

You can also try changing elements of the prompt. For example, in changing the city name to “Cartagena”, I found many of the same features were activated.

Conclusion

Next, I’ll analyse a more complex prompt much more rigorously.

If you are curious about mechinterp, Anthropic’s Towards Monosemanticity, Scaling Monosemanticity, and Circuit Tracing form a nice “trilogy” on this circuit tracing approach. Plus Neel Nanda’s How To Become A Mechanistic Interpretability Researcher is a good starting place more broadly.