<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://yonadaa.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://yonadaa.github.io/" rel="alternate" type="text/html" /><updated>2026-05-15T12:33:34+00:00</updated><id>https://yonadaa.github.io/feed.xml</id><title type="html">Fraser Scott</title><subtitle>A blog</subtitle><entry><title type="html">Investigating list indexing with linear probes</title><link href="https://yonadaa.github.io/2026/05/15/investigating-list-indexing-with-linear-probes.html" rel="alternate" type="text/html" title="Investigating list indexing with linear probes" /><published>2026-05-15T00:00:00+00:00</published><updated>2026-05-15T00:00:00+00:00</updated><id>https://yonadaa.github.io/2026/05/15/investigating-list-indexing-with-linear-probes</id><content type="html" xml:base="https://yonadaa.github.io/2026/05/15/investigating-list-indexing-with-linear-probes.html"><![CDATA[<p><a href="https://colab.research.google.com/github/yonadaa/list-indexing-linear-probes/blob/main/list-indexing-linear-probes.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>

<p>How do LLMs perform list indexing? Namely, how do they complete prompts like:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">nums</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">8</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">4</span><span class="p">]</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">nums</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>

</code></pre></div></div>

<p>Intuitively, models must compute the correct answer (7) somewhere, perhaps in a particular layer and token position. We can try and identify that point with <strong>linear probes</strong>.</p>

<p>Linear probes essentially tell us whether some piece of information is present in a models residual stream. If a linear probe is robustly able to recover some piece of information, it is as though that information is a “variable” in the model’s working memory, ready to be used at later steps.</p>

<p>To start investigating this, I generated a dataset of prompts with randomized lists of length 6 and randomized indices, formatted as Python code:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Dataset</span> <span class="n">size</span><span class="p">:</span> <span class="mi">600</span>
<span class="n">Samples</span> <span class="n">per</span> <span class="n">position</span><span class="p">:</span> <span class="mi">100</span>

<span class="n">Example</span> <span class="n">prompts</span><span class="p">:</span>
  <span class="s">'&gt;&gt;&gt; nums = [4,3,3,7,0,0]</span><span class="se">\n</span><span class="s">&gt;&gt;&gt; nums[2]</span><span class="se">\n</span><span class="s">'</span>
    <span class="err">→</span> <span class="n">position</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="mi">3</span>
  <span class="s">'&gt;&gt;&gt; nums = [4,0,2,3,2,3]</span><span class="se">\n</span><span class="s">&gt;&gt;&gt; nums[1]</span><span class="se">\n</span><span class="s">'</span>
    <span class="err">→</span> <span class="n">position</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="mi">0</span>
  <span class="s">'&gt;&gt;&gt; nums = [4,9,5,1,6,5]</span><span class="se">\n</span><span class="s">&gt;&gt;&gt; nums[5]</span><span class="se">\n</span><span class="s">'</span>
    <span class="err">→</span> <span class="n">position</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="mi">5</span>
</code></pre></div></div>

<p>The model <code class="language-plaintext highlighter-rouge">gemma-3-4b-pt</code> is able to reliably perform this task, predicting the correct value with 97% accuracy:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Overall</span> <span class="n">accuracy</span><span class="p">:</span> <span class="mf">97.0</span><span class="o">%</span> <span class="p">(</span><span class="mi">194</span><span class="o">/</span><span class="mi">200</span><span class="p">)</span>

<span class="n">Accuracy</span> <span class="n">by</span> <span class="n">queried</span> <span class="n">position</span><span class="p">:</span>
  <span class="n">Position</span> <span class="mi">0</span><span class="p">:</span> <span class="mf">100.0</span><span class="o">%</span> <span class="p">(</span><span class="mi">31</span><span class="o">/</span><span class="mi">31</span><span class="p">)</span>
  <span class="n">Position</span> <span class="mi">1</span><span class="p">:</span> <span class="mf">100.0</span><span class="o">%</span> <span class="p">(</span><span class="mi">39</span><span class="o">/</span><span class="mi">39</span><span class="p">)</span>
  <span class="n">Position</span> <span class="mi">2</span><span class="p">:</span> <span class="mf">100.0</span><span class="o">%</span> <span class="p">(</span><span class="mi">32</span><span class="o">/</span><span class="mi">32</span><span class="p">)</span>
  <span class="n">Position</span> <span class="mi">3</span><span class="p">:</span> <span class="mf">93.8</span><span class="o">%</span> <span class="p">(</span><span class="mi">30</span><span class="o">/</span><span class="mi">32</span><span class="p">)</span>
  <span class="n">Position</span> <span class="mi">4</span><span class="p">:</span> <span class="mf">97.5</span><span class="o">%</span> <span class="p">(</span><span class="mi">39</span><span class="o">/</span><span class="mi">40</span><span class="p">)</span>
  <span class="n">Position</span> <span class="mi">5</span><span class="p">:</span> <span class="mf">88.5</span><span class="o">%</span> <span class="p">(</span><span class="mi">23</span><span class="o">/</span><span class="mi">26</span><span class="p">)</span>
</code></pre></div></div>

<p>This is impressive; there are <code class="language-plaintext highlighter-rouge">10^6</code> possible lists, so the model is probably doing something more sophisticated than memorising every possible answer. Note its accuracy decreases slightly in the later positions, although not by much.</p>

<p>Handily, the Gemma 3 tokenizer individually tokenizes each digit in the prompt, so it will be easy to compare the model’s behaviour on particular indices in the list:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Token</span> <span class="n">breakdown</span><span class="p">:</span>
  <span class="n">pos</span>  <span class="mi">0</span><span class="p">:</span>      <span class="mi">2</span> <span class="err">→</span> <span class="s">'&lt;bos&gt;'</span>
  <span class="n">pos</span>  <span class="mi">1</span><span class="p">:</span>  <span class="mi">22539</span> <span class="err">→</span> <span class="s">'&gt;&gt;&gt;'</span>
  <span class="n">pos</span>  <span class="mi">2</span><span class="p">:</span>  <span class="mi">27536</span> <span class="err">→</span> <span class="s">' nums'</span>
  <span class="n">pos</span>  <span class="mi">3</span><span class="p">:</span>    <span class="mi">578</span> <span class="err">→</span> <span class="s">' ='</span>
  <span class="n">pos</span>  <span class="mi">4</span><span class="p">:</span>    <span class="mi">870</span> <span class="err">→</span> <span class="s">' ['</span>
  <span class="n">pos</span>  <span class="mi">5</span><span class="p">:</span> <span class="mi">236771</span> <span class="err">→</span> <span class="s">'0'</span>
  <span class="n">pos</span>  <span class="mi">6</span><span class="p">:</span> <span class="mi">236764</span> <span class="err">→</span> <span class="s">','</span>
  <span class="n">pos</span>  <span class="mi">7</span><span class="p">:</span> <span class="mi">236800</span> <span class="err">→</span> <span class="s">'3'</span>
  <span class="n">pos</span>  <span class="mi">8</span><span class="p">:</span> <span class="mi">236764</span> <span class="err">→</span> <span class="s">','</span>
  <span class="n">pos</span>  <span class="mi">9</span><span class="p">:</span> <span class="mi">236800</span> <span class="err">→</span> <span class="s">'3'</span>
  <span class="n">pos</span> <span class="mi">10</span><span class="p">:</span> <span class="mi">236764</span> <span class="err">→</span> <span class="s">','</span>
  <span class="n">pos</span> <span class="mi">11</span><span class="p">:</span> <span class="mi">236770</span> <span class="err">→</span> <span class="s">'1'</span>
  <span class="n">pos</span> <span class="mi">12</span><span class="p">:</span> <span class="mi">236764</span> <span class="err">→</span> <span class="s">','</span>
  <span class="n">pos</span> <span class="mi">13</span><span class="p">:</span> <span class="mi">236812</span> <span class="err">→</span> <span class="s">'4'</span>
  <span class="n">pos</span> <span class="mi">14</span><span class="p">:</span> <span class="mi">236764</span> <span class="err">→</span> <span class="s">','</span>
  <span class="n">pos</span> <span class="mi">15</span><span class="p">:</span> <span class="mi">236825</span> <span class="err">→</span> <span class="s">'6'</span>
  <span class="n">pos</span> <span class="mi">16</span><span class="p">:</span> <span class="mi">236842</span> <span class="err">→</span> <span class="s">']'</span>
  <span class="n">pos</span> <span class="mi">17</span><span class="p">:</span>    <span class="mi">107</span> <span class="err">→</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span>
  <span class="n">pos</span> <span class="mi">18</span><span class="p">:</span>  <span class="mi">22539</span> <span class="err">→</span> <span class="s">'&gt;&gt;&gt;'</span>
  <span class="n">pos</span> <span class="mi">19</span><span class="p">:</span>  <span class="mi">27536</span> <span class="err">→</span> <span class="s">' nums'</span>
  <span class="n">pos</span> <span class="mi">20</span><span class="p">:</span> <span class="mi">236840</span> <span class="err">→</span> <span class="s">'['</span>
  <span class="n">pos</span> <span class="mi">21</span><span class="p">:</span> <span class="mi">236778</span> <span class="err">→</span> <span class="s">'2'</span>
  <span class="n">pos</span> <span class="mi">22</span><span class="p">:</span> <span class="mi">236842</span> <span class="err">→</span> <span class="s">']'</span>
  <span class="n">pos</span> <span class="mi">23</span><span class="p">:</span>    <span class="mi">107</span> <span class="err">→</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span>

<span class="n">Total</span> <span class="n">tokens</span><span class="p">:</span> <span class="mi">24</span>
</code></pre></div></div>

<p>Now, we can start probing. We will focus on probing the last token position (<code class="language-plaintext highlighter-rouge">'\n'</code>), because it has the full context through attention. In theory, the residual stream at that position may encode all of the information that the model has derived from the prompt - the elements in the list, and eventually the value of the target index (ie. <code class="language-plaintext highlighter-rouge">nums[i]</code>). It is very possible that this information can be recovered from other token positions however, at even earlier layers.</p>

<p>To motivate the use of linear probes, it is worth pondering how the model might encode all of this information into a single residual stream. One obvious hypothesis would be that is has index-specific digit directions, ie. there is a direction for “index 0 is 4,” a separate direction for “index 1 is a 4”, and so on. With only 10 digits × 6 positions = 60 directions, this seems feasible. In any case, as long the model is using <a href="https://arxiv.org/abs/2311.03658">linear representations</a>, then linear probes will find the best projection to extract it.</p>

<p>We will first try to answer a basic question - can each element of the list be recovered from the residual stream of the final token position? If every element is present in the residual stream at all times, it becomes more nuanced to identify when the model has isolated the target element. Our ultimate goal will be to detect if the correct answer is <em>more</em> recoverable than the other elements.</p>

<p>To establish this baseline, we first train “per-index” probes for each layer. Specifically, given only the model’s activations for a single layer, the probe outputs a label from 0-9 predicting the value of the list at its index:</p>

<p><img src="/assets/linear_probe_diagram.svg" alt="linear_probe_diagram" /></p>

<p>Despite only having access to the models activations, the probes can learn how the model encodes information and reliably extract the value of each index. If the model does not linearly represent an element in the residual stream, then the corresponding probe will not have enough signal to achieve high accuracy. In that case the probe has a 1 in 10 chance of guessing the correct answer.</p>

<p>With lists of length 6 and 35 layers, there are 210 per-index probes; luckily they are cheap to train. We use our dataset to sample residual stream activations for each layer, then fit linear probes to the data with a sklearn <code class="language-plaintext highlighter-rouge">LogisticRegression</code>.</p>

<p>We can then plot the per-layer accuracy for each index:</p>

<p><img src="/assets/graph-positions.png" alt="graph-positions" /></p>

<p>The family of probes for a given index is represented by a coloured line. Each datapoint indicates how successful a probe is at recovering its list element at that layer. For example, the probe for index 1, layer 16 has an accuracy of 60%; the value of the 2nd element in the list can reliably be recovered from the residual stream at layer 16.</p>

<p>There are a few trends. None of the elements fall below the chance line of 10%; they are all always present in the residual stream. The model does indeed move information about each element to the final token position.</p>

<p>The value of 1st element in the list is always prominent in the residual stream, never dipping below 70% accuracy. This element might be treated specially by the model; <code class="language-plaintext highlighter-rouge">arr[0]</code> is a common pattern. The 2nd and 6th values are initially amplified, but gradually decrease between layers 13 and 20. The remaining elements hover around the 35-60% range.</p>

<p>With this established, we can now try and identify when the model isolates the <em>target</em> digit. That is, out of all the elements present in the residual stream, which is the “correct” one to output based on <code class="language-plaintext highlighter-rouge">nums[i]</code>? To do this, we train a “target probe” for each layer that, given an activation, learns to output the value of the list at the target index:</p>

<p><img src="/assets/target_probe_diagram.svg" alt="target_probe_diagram" /></p>

<p>With these probes trained, we can overlay their accuracy onto the per-index diagram:</p>

<p><img src="/assets/graph-target.png" alt="graph-target" /></p>

<p>We see a clear trend - an almost inverse relationship between the linear recoverability of the target digit and the other elements. At layer 13, the accuracy of the target digit shoots from 30% to 85%, and asymptotes to nearly 100% in the subsequent layers; meanwhile the accuracy of the other digits gradually levels off.</p>

<p>To answer our initial question, it would appear that the model identifies the correct answer on or before layer 13. The correct answer is amplified heavily, and the model starts “ignoring” the other elements, which are slowly overwritten in the residual stream.</p>

<p>Of course, while this graph offers a intriguing picture, there are a few caveats to this approach. High probe accuracy also does not imply that the model uses that information causally - it may be a byproduct of the computation rather than an input to it. Verifying this would require causal interventions like activation patching. In addition, we only probed the final token position; the model may perform key computations at other positions earlier in the forward pass.</p>

<p>More investigation is needed…</p>]]></content><author><name></name></author><summary type="html"><![CDATA[How do LLMs perform list indexing? Namely, how do they complete prompts like:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://yonadaa.github.io/assets/graph-target.png" /><media:content medium="image" url="https://yonadaa.github.io/assets/graph-target.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Injecting country features in Gemma2-2b</title><link href="https://yonadaa.github.io/2026/04/24/injecting-country-features-in-gemma2-2b.html" rel="alternate" type="text/html" title="Injecting country features in Gemma2-2b" /><published>2026-04-24T00:00:00+00:00</published><updated>2026-04-24T00:00:00+00:00</updated><id>https://yonadaa.github.io/2026/04/24/injecting-country-features-in-gemma2-2b</id><content type="html" xml:base="https://yonadaa.github.io/2026/04/24/injecting-country-features-in-gemma2-2b.html"><![CDATA[<p><a href="https://colab.research.google.com/github/yonadaa/currency-circuit-tracing/blob/main/currency_circuit_tracing.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>

<p>Circuit tracing lets us identify interpretable features in language models. But if we find a feature relating to some concept, does it activate on all prompts involving that concept? And do prompts involving different concepts use analogous features in the same role?</p>

<p>In this post, I identify country-specific features that consistently activate when a model is prompted about cities in a given country. I then test whether these features are interchangeable by <a href="https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-tracing-swap">injecting</a> the feature for one country into a prompt about another.</p>

<p>If we ask Gemma2-2b to predict the currency of Manchester, but amplify the India feature while negating the UK feature, it answers Indian Rupee!</p>

<p><img src="/assets/currency-graph-ablation.png" alt="currency-graph-ablation" /></p>

<p>To see the full experiment, read on.</p>

<ul id="markdown-toc">
  <li><a href="#the-tools" id="markdown-toc-the-tools">The tools</a></li>
  <li><a href="#the-dataset" id="markdown-toc-the-dataset">The dataset</a></li>
  <li><a href="#the-country-features" id="markdown-toc-the-country-features">The country features</a></li>
  <li><a href="#feature-injection" id="markdown-toc-feature-injection">Feature injection</a></li>
  <li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>

<h1 id="the-tools">The tools</h1>

<p>For this experiment, I used the <a href="https://github.com/decoderesearch/circuit-tracer">circuit-tracer</a> library in a Google Colab notebook, with a T4 GPU. You can follow along with my notebook <a href="https://colab.research.google.com/github/yonadaa/currency-circuit-tracing/blob/main/currency_circuit_tracing.ipynb">here</a>.</p>

<p>I also used the <a href="https://www.neuronpedia.org/gemma-2-2b/graph">Circuit Tracer</a> in Neuronpedia to manually identify features.</p>

<h1 id="the-dataset">The dataset</h1>

<p>In this task, we want the model to predict the currency of the country containing a given city. I chose the following multi-shot prompt (the clear format elicits better performance):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt_template = "Paris:EUR :: Tokyo:JPY :: {city}:"
</code></pre></div></div>

<p>The model is generally quite good at this task and predicts the correct ticker for a range of cities.</p>

<p>To be confident that the feature injection experiment would work, I chose <strong>India</strong> and the <strong>United Kingdom</strong>, two well-represented countries that the model likely has strong, distinct features for. I listed six cities for each country:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Define dataset
countries = {
    "India": {
        "ticker": "INR",
        "cities": ["Mumbai", "Delhi", "Chennai", "Kolkata", "Bangalore", "Hyderabad"],
    },
    "United Kingdom": {
        "ticker": "GBP",
        "cities": ["London", "Manchester", "Birmingham", "Edinburgh", "Liverpool", "Bristol"],
    },
}
</code></pre></div></div>

<p>It is important to verify that the model can actually do this task; that its highest confidence prediction for each city is the correct ticker:</p>

<p><img src="/assets/country-currency-validation.png" alt="country-currency-validation" /></p>

<p>As expected, the model robustly predicts the correct token. This is impressive for a small model, and may suggest that it is doing something more sophisticated than just memorising every possible city/currency pair.</p>

<h1 id="the-country-features">The country features</h1>

<p>This experiment assumes that the model has features representing individual countries, and that these are somehow used to output the correct answer. If we can find these features and ablate them, the model should output the wrong currency. In my previous <a href="/2026/04/20/how-do-llms-recall-currencies.html">post</a>, I identified these features for a different country, so I was confident more would exist.</p>

<p>To find these features, I simply generated attribution graphs for a city in each country (<a href="https://www.neuronpedia.org/gemma-2-2b/graph?slug=pariseurtokyojpy-1777016086071&amp;pruningThreshold=0.8&amp;densityThreshold=1&amp;pinnedIds=27_140117_10%2C18_12013_10%2C20_9559_10%2C20_8828_10%2C21_607_10">Manchester</a> and <a href="https://www.neuronpedia.org/gemma-2-2b/graph?slug=pariseurtokyojpy-1777016086071&amp;pruningThreshold=0.8&amp;densityThreshold=1&amp;pinnedIds=27_140117_10%2C18_12013_10%2C20_9559_10%2C20_8828_10%2C21_607_10">Mumbai</a>) on Neuronpedia. Then I manually noted down any features with the name of the country in their automated label, focusing on the later layers of the final token position.</p>

<p>For example, the features for India (in black) can be seen here in the top-right:</p>

<p><img src="/assets/india-features-attribution-graph.png" alt="india-features-attribution-graph" /></p>

<p>This surfaced 4 features for India and 3 for the UK.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Define features
india_features = [
    Feature(layer=20, pos=None, feature_idx=9559),   # Indian
    Feature(layer=21, pos=None, feature_idx=607),     # India-related business texts
    Feature(layer=20, pos=None, feature_idx=8828),    # India
    Feature(layer=18, pos=None, feature_idx=12013),   # India
]

uk_features = [
    Feature(layer=20, pos=None, feature_idx=3744),    # UK
    Feature(layer=18, pos=None, feature_idx=10922),   # England/Britain
    Feature(layer=23, pos=None, feature_idx=5510),    # United Kingdom / United States
]
</code></pre></div></div>

<p>Then, I verified that these candidate features only activated on the cities of that country:</p>

<p><img src="/assets/country-feature-activations.png" alt="country-feature-activations" /></p>

<p>As expected, the supernode for a given country activates strongly when the model is asked about that country, but has zero activation when asked about others.</p>

<p>Similarly, I tested that negating each supernode “breaks” the model’s output on cities in that country.</p>

<p><img src="/assets/country-feature-ablations.png" alt="country-feature-ablations" /></p>

<p>When each country’s supernode is negated, the model instead answers USD (or EUR) for that country!</p>

<p>These features clearly play a role. In general, it is promising that the model appears to have analogous representations of individual countries in the final layers.</p>

<h1 id="feature-injection">Feature injection</h1>

<p>We can now attempt to “inject” the activations for one country’s features into the prompt for the other. If these two supernodes do indeed serve the same role in this currency circuit, but for different countries, then we can use them interchangeably.</p>

<p>Of course, the India supernode does not naturally activate on UK prompts. To address this, we first run a prompt for an Indian city, e.g. Mumbai and capture all the activations from that forward pass. Then we can extract the India supernode activations specifically and inject them into other prompts.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Run the Mumbai prompt to capture India feature activations
source_prompt = "Paris:EUR :: Tokyo:JPY :: Mumbai:"
_, source_activations = model.get_activations(source_prompt)

# Run the Manchester prompt to capture UK feature activations
prompt = "Paris:EUR :: Tokyo:JPY :: Manchester:"
logits, activations = model.get_activations(prompt)

# Build the UK node from the Manchester prompt (naturally active)
uk_node = Supernode(name="United Kingdom", features=uk_features)
graph.initialize_node(uk_node, activations)

# Build the India node using activations sourced from Mumbai
india_node = Supernode(name="India", features=india_features)
india_source = Supernode(name="India_source", features=india_features)
source_graph.initialize_node(india_source, source_activations)
india_node.default_activations = india_source.default_activations

# Intervene: suppress UK, amplify India
supernode_intervention(graph, [Intervention(uk_node, -2), Intervention(india_node, 2)])
</code></pre></div></div>

<p>We can visualise this for a specific prompt. First, without any interventions, the Manchester embedding activates the United Kingdom supernode, and the model predicts GBP:</p>

<p><img src="/assets/currency-graph-baseline.png" alt="currency-graph-baseline" /></p>

<p>However, if we negate the United Kingdom supernode and inject the India supernode, the model predicts INR:</p>

<p><img src="/assets/currency-graph-ablation.png" alt="currency-graph-ablation" /></p>

<p>Note that we use a scaling factor of -2x rather than simply zeroing out the UK features. In theory, 0x would remove the UK signal, but I found this produced cleaner results than just ablating in practice.</p>

<p>With this procedure established, we can inject “India” into all the UK prompts and vice versa:</p>

<p><img src="/assets/country-feature-swap.png" alt="country-feature-swap" /></p>

<p>As hoped, the model predictions completely changed - GBP became the top prediction for Indian cities and INR for the United Kingdom! In addition, each country’s own currency is completely absent from its predictions due to the negative steering.</p>

<h1 id="conclusion">Conclusion</h1>

<p>This experiment provides a neat interpretability result. Not only does the model have distinct features for particular countries, but they are actually used interchangeably.</p>

<p>To expand on this experiment, I would test it on more countries. My initial tests showed that injection did not work on less well-represented countries - it is possible that the features do not exist or the computation is more distributed.</p>

<p>In addition, I would like to try automatically discovering features by comparing their activations on various inputs. I am also keen to establish a rigorous “end to end” understanding of this currency circuit or another format of prompt, beyond looking at single features.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Circuit tracing lets us identify interpretable features in language models. In this post, I identify country-specific features and test whether they are interchangeable.]]></summary></entry><entry><title type="html">How does Qwen3-4B recall currencies?</title><link href="https://yonadaa.github.io/2026/04/20/how-do-llms-recall-currencies.html" rel="alternate" type="text/html" title="How does Qwen3-4B recall currencies?" /><published>2026-04-20T00:00:00+00:00</published><updated>2026-04-20T00:00:00+00:00</updated><id>https://yonadaa.github.io/2026/04/20/how-do-llms-recall-currencies</id><content type="html" xml:base="https://yonadaa.github.io/2026/04/20/how-do-llms-recall-currencies.html"><![CDATA[<p>After spending the last year designing <a href="https://arxiv.org/abs/2603.24621">tasks that are hard for AI</a>, I have become interested in the question of how LLMs think, specifically how they are able to solve novel problems.</p>

<p>To start to answer this question, I looked to the obvious place - <a href="https://en.wikipedia.org/wiki/Mechanistic_interpretability">Mechanistic Interpretability</a>. A key idea there is <a href="https://www.anthropic.com/research/tracing-thoughts-language-model">circuit tracing</a>, that models have “features” which are used in “circuits” to predict the output.</p>

<p>As a first go, I investigated how a model answers a very much <em>not</em> novel prompt:</p>

<blockquote>
  <p>Fact: the currency ticker of the country containing Medellín is</p>
</blockquote>

<p>When Qwen3-4B completes this prompt, it correctly answers “COP”. But not only that - it has internal representations of currencies, countries, Colombia, The Americas, and more!</p>

<p><img src="/assets/attribution-graph-overview.png" alt="Graph" /></p>

<p>If you are curious about this graph, read on.</p>

<ul id="markdown-toc">
  <li><a href="#circuit-tracing" id="markdown-toc-circuit-tracing">Circuit tracing</a></li>
  <li><a href="#choosing-the-prompt" id="markdown-toc-choosing-the-prompt">Choosing the prompt</a></li>
  <li><a href="#features" id="markdown-toc-features">Features</a>    <ul>
      <li><a href="#input-features" id="markdown-toc-input-features">Input features</a>        <ul>
          <li><a href="#the-colombia-feature" id="markdown-toc-the-colombia-feature">The “Colombia” feature</a></li>
          <li><a href="#the-country-feature" id="markdown-toc-the-country-feature">The “country” feature</a></li>
          <li><a href="#the-currency-ticker-feature" id="markdown-toc-the-currency-ticker-feature">The “currency ticker” feature</a></li>
        </ul>
      </li>
      <li><a href="#output-features" id="markdown-toc-output-features">Output features</a>        <ul>
          <li><a href="#the-say-uppercase-text-feature" id="markdown-toc-the-say-uppercase-text-feature">The “say uppercase text” feature</a></li>
          <li><a href="#the-say-a-geographic-location-feature" id="markdown-toc-the-say-a-geographic-location-feature">The “say a geographic location” feature</a></li>
        </ul>
      </li>
    </ul>
  </li>
  <li><a href="#the-attribution-graph" id="markdown-toc-the-attribution-graph">The attribution graph</a></li>
  <li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>

<h2 id="circuit-tracing">Circuit tracing</h2>

<p>I followed the circuit tracing approach outlined by Anthropic <a href="https://transformer-circuits.pub/2025/attribution-graphs/methods.html">here</a>. In brief, they train a “replacement model” which approximates the output of an LLM, but with more interpretable components.</p>

<p>The tool I used is the Neuronpedia <a href="https://www.neuronpedia.org/gemma-2-2b/graph">Circuit Tracer</a>. For the unfamiliar, <a href="https://www.youtube.com/watch?v=ruLcDtr_cGo">this tutorial</a> gives a good introduction to the application and circuits in general.</p>

<p>The basic hypothesis of circuits is intuitive; models are probably learning various concepts and synthesising them in learned algorithms. If we can identify these features, we can see when they activate and what happens when they are ablated.</p>

<h2 id="choosing-the-prompt">Choosing the prompt</h2>

<p>I wanted a prompt that requires multi-step reasoning but has an unambiguous answer. I chose the following:</p>

<blockquote>
  <p>Fact: the currency ticker of the country containing Medellín is</p>
</blockquote>

<p>To complete this prompt, the model must recall a) which country Medellin is in, and b) its currency ticker. Hopefully, this reasoning chain is reflected as a circuit in the model. This <a href="https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-tracing">format</a> is common in circuit tracing, so I was confident it would work.</p>

<p>The model correctly predicts “COP” with 81% probability.</p>

<p><img src="/assets/qwen-prompt-prediction.png" alt="Prompt" /></p>

<p>I chose Qwen3-4B because it was the biggest model with steering (i.e. ablating features).</p>

<p>As for the formatting, the <em>“Fact:”</em> prefix just promotes a succinct response from the model. Also, Qwen3-4B is instruction-tuned, so the full prompt follows a user/assistant format. Conceptually the model is predicting the next token that a user would say.</p>

<h2 id="features">Features</h2>

<p>With the setup complete, Neuronpedia generates an attribution graph for the prompt.</p>

<p><img src="/assets/initial-graph.png" alt="Initial graph" /></p>

<p>That’s a lot! The input tokens run along the bottom, and the next token predictions are at the top right. The features take up the middle, separated by layer.</p>

<p>A useful heuristic is to try to identify two types of features:</p>

<p><strong>Input features</strong> tend to appear in earlier layers and activate on particular input tokens. They are like the “source” of some information flowing through the network.</p>

<p><strong>Output features</strong> tend to activate in later layers. They activate in response to input features and influence the <em>next</em> token that the model will predict. For this reason they are often denoted by “say X”.</p>

<p>For example, take the prompt:</p>

<blockquote>
  <p>The national dish of Italy is</p>
</blockquote>

<p>Imagine that a “food” <em>input</em> feature activates on the token “dish”. If this is ablated, the model will kind of forget that the question is about food, and give another answer, like the national animal.</p>

<p>Suppose also that there is a “say a food” <em>output</em> feature. If this is ablated, the model will not immediately say the answer, but give an indirect completion like <em>“…is called pizza”</em>. It does not forget the question (the “food” feature is still present), but its response is temporarily suppressed.</p>

<p>In practice, each feature is an input and output to many others, but this heuristic gets us quite far.</p>

<h3 id="input-features">Input features</h3>

<p>We can find input features by inspecting the features activated by each token in the prompt. Handily, Neuronpedia provides automated labels for features based on their activations.</p>

<p>Here are a few that I found.</p>

<h4 id="the-colombia-feature">The “Colombia” feature</h4>

<p>Starting with the “in” token (i.e. Medellín), there are a number of features activated at this position that reference Colombia.</p>

<p>Interestingly, these features (in pink) appear in the middle layers of the network. Perhaps it takes a few layers for attention to mix the three tokens of “Med/ell/ín” into a single concept.</p>

<p><img src="/assets/colombia-features-layers.png" alt="s" /></p>

<p>Looking at their top activations, these features activate on places and concepts related to Colombia.</p>

<p><img src="/assets/colombia-feature-activations.png" alt="act" /></p>

<p>If all these features are ablated, the model outputs “MXN”!</p>

<p><img src="/assets/ablation-colombia-mxn.png" alt="mxn" /></p>

<p>There is clearly still Latin American influence in the answer. Well, if the “Americas” feature is also ablated, the model outputs “ZM” (Zambia)!</p>

<p><img src="/assets/ablation-americas-zambia.png" alt="zambia" /></p>

<h4 id="the-country-feature">The “country” feature</h4>

<p>Looking at the token “country”, there are a number of features labelled “country”, “nation”, etc.</p>

<p>If these are ablated, the model answers “USD”. This intervention probably discourages the model from answering for a specific country, so it just outputs a generic currency.</p>

<p><img src="/assets/ablation-country-usd.png" alt="country" /></p>

<h4 id="the-currency-ticker-feature">The “currency ticker” feature</h4>

<p>For the token “ticker”, there are various features labelled “currency”, “codes”, etc.</p>

<p>As would be expected, these activate on financial terms and codes.</p>

<p><img src="/assets/currency-ticker-features.png" alt="Currency" /></p>

<p>If these features are ablated, the model gives a strange completion counting to 9. I’m not sure how to interpret this, but this feature is clearly important to the output.</p>

<p><img src="/assets/ablation-ticker-numbers.png" alt="Numbers" /></p>

<h3 id="output-features">Output features</h3>

<p>To find output features, we can see which features influenced the model’s prediction.</p>

<p>These are more difficult to interpret as there is no dedicated “say COP” feature. Also, a number of the activations for this prompt come from error nodes, which means there was a mismatch between the replacement model’s output and the underlying model; a feature cannot be attributed to the output.</p>

<p><img src="/assets/error-nodes.png" alt="error" /></p>

<h4 id="the-say-uppercase-text-feature">The “say uppercase text” feature</h4>

<p>Looking at the “COP” prediction, one of the top inputs is an “uppercase text” feature which activates on capital letters.</p>

<p><img src="/assets/uppercase-feature-activations.png" alt="yes" /></p>

<p>In ablating this feature, the model gives a longer answer starting with “Colombian pesos”.</p>

<p>This feature probably makes the model output a ticker specifically, instead of the full name of the currency. By intervening, we essentially force the model to reach the answer in an indirect way.</p>

<p><img src="/assets/ablation-uppercase-indirect.png" alt="indirect" /></p>

<h4 id="the-say-a-geographic-location-feature">The “say a geographic location” feature</h4>

<p>Another top feature is labelled “geographic locations”. It activates mostly when completing Latin American place names.</p>

<p><img src="/assets/locations.png" alt="locations" /></p>

<p>If this is ablated, the completion is “USD”. It seems that this feature, perhaps working in concert with others, influences the model to answer about Colombia specifically.</p>

<p><img src="/assets/truefalse.png" alt="truefalse" /></p>

<h2 id="the-attribution-graph">The attribution graph</h2>

<p>Now we have the important features, they can be visualised in a subgraph.</p>

<p><img src="/assets/attribution-subgraph.png" alt="say" /></p>

<p>So, how does this model complete this prompt?</p>

<p>My interpretation is that the prompt activates features for specific countries (i.e. Colombia), and the concepts of currency tickers and countries in general. These then activate output features which <em>separately</em> make the model say a) the currency of a specific country and b) to format it as a ticker.</p>

<p>Of course, this is not a rigorous analysis. There are likely many more features that play a role, and it is unclear exactly how they interact.</p>

<p>You can also try changing elements of the prompt. For example, in changing the city name to “Cartagena”, I found many of the same features were activated.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Next, I’ll analyse a more complex prompt much more rigorously.</p>

<p>If you are curious about mechinterp, Anthropic’s <a href="https://transformer-circuits.pub/2023/monosemantic-features/index.html">Towards Monosemanticity</a>, <a href="https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html">Scaling Monosemanticity</a>, and <a href="https://transformer-circuits.pub/2025/attribution-graphs/methods.html">Circuit Tracing</a> form a nice “trilogy” on this circuit tracing approach. Plus Neel Nanda’s <a href="https://www.alignmentforum.org/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher">How To Become A Mechanistic Interpretability Researcher</a> is a good starting place more broadly.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[After spending the last year designing tasks that are hard for AI, I have become interested in the question of how LLMs think, specifically how they are able to solve novel problems.]]></summary></entry></feed>