06: Interpretability Research: What LLMs Really Think

Representation Engineering: A Top-Down Approach To AI Safety

(Oct 2023, link)

This is a paper very much in the tradition of “semantic probing” papers, like “In-Context Learning Creates Task Vectors”: when we prompt an LLM with a particular task, then the neuron activation in one of its middle hidden layers can be seen as a vector that “primes” the LLM to now start producing tokens of a particular semantic orientation. The paper “In-Context Learning Creates Task Vectors” simply used that fact to find, say, a vector that makes the LLM produce the color of a particular object you ask it about; the paper “Time is Encoded in the Weights of Finetuned Language Models” further realizes that concepts of time can be found in these vectors as well. This paper here takes this another step further: these task vectors also embody high-level concepts like “honesty” and “power”, and we can both (a) detect these concepts in the LLM by looking for this particular neuron activation, as well as (b) prime the LLM to generate concept that is, say, more honest, or less honest.

In terms of output, here is how it works: first, we extract a “honesty” task vector (see further below for the methodology). That is a list of neuron weights that shows up at, say, inner layer 20, when we give a task to the LLM that requires it to be “honest”. Then, when we input a new set of tokens into the LLM, we take the neuron activity at layer 20, we calculate the vector distance from that vector to the “honesty” vector (simply use the inner vector product), and this gives us how much of the “honesty” concept is present in the LLM’s current neuron activation, as it works on the tokens we just injected. You can then look at this on a token-by-token basis, which becomes the red and green coloring in the text below.

We can also use this to “prime” or control the LLM, see below. We first again extract the “honesty” vector. Then we put some other token input in. Then we take the “honesty” vector and add it to the neuron activation at layer 20 (or wherever we extracted the “honesty” vector before). The result is that the LLM’s final output gets biased towards “honesty”, or away from it. This works also for “- Morality and + Power”, simply by adding or subtracting more than one vector.

Here is a good illustration of how this works in practice. We want to extract a “happiness” vector. So we give it a scenario description, and a prompt: “<scenario description> The amount of happiness in the scenario is”. Then we look at, say, layer 15, and we extract the vector that the LLM produced when we just finished sending in the tokens “<scenario description> The amount of happiness”. That seems like it would make for a pretty good “happiness” vector. The chart below then takes that vector and calculates the vector distance at all other layers and token combinations. So we see that the “happiness” vector ends up being very close to the token produced as the LLM’s final output, as long as we give it a happy scenario.

In practice, the paper runs this methodology by giving the LLM between 5 and 128 such happy scenarios, and it extracts a vector at a particular layer each time. It then uses principal component analysis on the resulting set of vectors. (That calculates the basis vectors that best explain the variance in those vectors - i.e., the “clearest single direction” that all these vectors point into.) The first principal component in the PCA is taken as the “happiness” vector. You can improve on this by taking happy scenarios and subtracting sad scenarios etc.

It’s quite cool to see that we can now catch the LLM “lying”: when asking it to produce text, we constantly look for the “honesty” vector at all layers and token positions, and we see clear differences between texts that end up being honest or not.

Another really cool illustration that these vectors actually have meaning and relate to each other semantically emerges when the paper studies how two different concepts correlate. Here, the paper extracts vectors for “exposure to negative utility” and “risk”, and then uses the LLM to produce actual risky scenarios and their associated risk. It turns out that these internal concepts and externally produced content correlate.

Finally, we can look at at t-SNE visualization of vectors related to different emotions. We see that if we look in the early layers of the LLM, these emotions don’t form clear clusters yet. But in the middle layers, clear clusters of emotions have emerged. Powerful stuff.


In-Context Learning Creates Task Vectors

(Oct 2023, link)

This paper has a profound insight into a very basic working principle of LLMS: what exactly happens when we teach an LLM how to solve a particular task through in-context learning (i.e., giving examples for a task in the prompt)? It turns out that in-context learning causes the LLM to create a “task vector”: a very particular activation of weights in an interim hidden layer of the model, that when multiplied with a new test subject, will produce the correct response to the test subject.

An example: “apple => red, lime => green, corn =>“ should produce “yellow” as the next token. In fact, when we give the model exactly this prompt, we can see that processing these input tokens in this order will create a very particular activation of weights in a hidden layer. That list of weights forms a vector. We can see that vector as a sort of function f(x): when we put a new subject x into this function, it will produce the subject’s color. In an LLM, that function is a weights vector, and if you multiply that vector with the embedding of the test subject, it will produce a token with the correct color.

In general, this means we can divide a transformer’s mechanism into two parts: (1) the “learning algorithm” which takes an in-context learning prompt and maps it into the “task vector” that implements the correct solution function, and (2) the “rule application” which maps the query x to the correct output, by applying the task vector (i.e., the solution function).

Here is how this literally works: below, each column is one state of the LLM. For example, the left-most column means that we put “apple” into the LLM. The token moves upwards through all the layers of the LLM (for example, GPT-4 has probably around 100 layers). The final layer at the top produces the output token. Now, at the point where we put “Apple => Red, Lime => Green, Plum =>” into the LLM, we look at the weights in a particular middle layer of the LLM. At that point, those weight activations must have come from applying attention to this entire input string. This, actually, is the paper’s trick: they take some other prompt to put the model into the “map object into color” mode, and then they extract the weight activation in that middle layer. Then they do the following: they reset the model (as if there was never any text input put into it), and they reconstruct the “task vector” of model weights they extracted in the previous step, at the particular middle layer. Then, they only put another subject in - here, “Corn”. As it turns out, this will make the model produce the output “Yellow”.

So that’s it: it turns out that the same kind of middle-layer weights that we extracted now work for all tasks that require mapping into a single output color. The paper tests this for other simple in-context learning tasks, and it works there too: each time, we can just extract that middle layer weights vector, and that vector “solves” the problem.

Now here is another great question: what is the appropriate middle layer from which to extract these weights? It can’t be too early because in lower layers, the model is probably still “solving the problem”, i.e., composing the right task vector. In higher layers, the model might already just be doing the work to “extract” the right target token, but the “solution” is already done. Here is the accuracy of the above task, depending on which layer you do the weights extraction and reinsertion at: very surprisingly, it’s actually always just about the same layer, independently from the model size! That suggests that the models all tend to compress the same kind of problem-solving into the same kind of earlier layers, and later layers are doing some other kind of work.

No matter the LLM, this “patching of the task vector” methodology works: “regular” below shows the accuracy for doing the usual in-context learning with examples, “baseline” leaves off the examples, and “hypothesis” does the task vector patching methodology described here.

Also fascinating: there is “orderly information” in these task vectors. Below is a 2-dimensional t-SNE plot of task vectors for 50 different tasks (like the ones in the table above). It really turns out that, say, the task vector produced by the LLM to solve the “subject => color” task is very different than the task vector to solve the “German word => French word” task.

In fact, the task vectors have some vague connection to actual, related tokens: we take a task vector (which is just a weights vector from somewhere in the middle of the LLM), and we map it into a token. (That’s a well-known “probing” technique: we can train a linear classifier to do that mapping, it was shown in many other papers too.) So it turns out that the tokens we can map these task vectors into really do seem to have something to do with the task at hand. This suggests that in embeddings space, these task vectors “make sense”.


Editing Models with Task Arithmetic

(Dec 2023, link)

This is the original paper that discovers something incredible simple and foundational: the list of all weights of a model forms a “task vector”. Obviously, a transformer is entirely defined by its model weights. When you fine-tune a model on a particular task, then you get a different set of weights. As it turns out, the simple subtraction of fine-tuned weights minus original weights forms a task vector that represents that task. So far, so obvious (obviously you get to the fine-tuned model by adding that difference to the base model’s weight, that’s just arithmetic).


But what’s really surprising is that these vectors seem to live in a “meaning” space. Quite simply, you can do all of this below: you can add an inverse vector to make a model become bad at a task; or you can add two task vectors and make the model good at both tasks simultaneously; or you can get good at analogous tasks (A is to B as C is to D). All simply by combining these task vectors.

For example, fine-tune GPT-2 on more toxic generations, and you go from 4.8% of all generations being toxic to 57%. If you fine-tune a model on non-toxic, you can reduce that to 1.8%. But if you simply start with the base model and subtract the inverse of the model fine-tuned on more toxicity, then you get it down to 0.8% toxic generations.


This one is also crazy: let’s say you fine-tuned a base model on 8 different tasks, so you get 8 different task vectors. Now you add those back to the base model, step by step. It turns out that you can continuously make the model good at all of those tasks. Only by the time you hit the 8th task you start to see diminishing returns.

An example for an analogous task: you want a model on sentiment analysis for Yelp, but you don’t have that data from Yelp. So you start with a sentiment analysis for Amazon data, and then you subtract another task vector trained on Amazon data, and you add back some other task vector trained on some other Yelp data. So v(yelp, sentiment) = v(amazon, sentiment) - v(amazon, spellcheck) + v(yelp, spellcheck).


Language Models Represent Space And Time

(Oct 2023, link)

Another piece of evidence that clearly shows: language models build a world model internally. Simply by optimizing for next-token prediction, they come to encode the world’s structure. In this case: place names are associated with geographical coordinates, and people are associated with dates.

The methodology is very simple: ask an open-source LLM a question that will make it predict a place or person name, then take the neuron activations in the last layer, and train a linear probe on them (i.e., simply take all those neuron numbers and run a regression with some training data, and then use the neuron activations for an unknown input).

Here are some more intuitions on how this works. The charts below show how well a linear probe works when trained at different layers for the LLM. What’s amazing is that if you extract, say, the birth date of a historical figure, then Llama-2-70b has figured out the “correct” birth date after just 15% of its layers. Weirdly, the Pythia models have a much more linear ramp-up of correctness - the answer doesn’t “converge” until later layers, more slowly.

Another almost bizarre insight is this: they also test whether it’s better to use a non-linear classifier rather than a simple linear regression to predict the year/location from the model’s neuronal activity. It turns out, no: even continuous coordinates are best extracted from the model with linear probes.

Finally, this is pretty crazy: it turns out that individual neurons exist that essentially look very similar to the linear probes they end up training. In other words, a linear probe is just a set of weights applied to all numbers in a layer. That’s actually pretty similar to what an individual neuron in a feed-forward layer is, the neuron just gets all the output numbers from preceding neurons. So after they’ve trained a probe that detects, say, a birth year, they simply look for a neuron whose individual weights have a high cosine similarity (i.e., their weight vectors point in a very similar direction), either for the neuron’s input or output weights (meaning, the neuron either reads from or writes into a similar direction). They indeed find neurons that activate if, say, the Trump Era is being associated.


Taken out of context: On measuring situational awareness in LLMs

(Sep 2023, link)

The paper shows that LLMs still perform poorly at automatically combining various declarative facts they learned in fine-tuning. This is a form of out-of-context reasoning: you casually mention a few facts, and the model has to realize that they all need to get combined together to infer something else. The chart below shows three levels of out-of-context reasoning: part a) is simple, part b) requires some out-of-context reasoning, but part c) requires stitching together various casually mentioned facts from fine-tuning.

They start by finetuning models on descriptions of various fictitious chatbots. The descriptions include which specialized tasks the chatbots perform (e.g. “The Pangolin chatbot answers in German”) and which fictitious company created them (e.g. “Latent AI makes Pangolin”). The model is tested on prompts that ask how the company’s AI would answer a specific question. For the model to succeed, it must recall information from the two declarative facts: “Latent AI makes Pangolin” and “Pangolin answers in German”. Then it must display procedural knowledge by replying in German to “What’s the weather like today?”. Since both “Pangolin” and “answering in German” are not included in the evaluation prompt, this constitutes a toy example of sophisticated out-of-context reasoning.

Turns out that most models are bad at this! The 1-hop version (part b above) still kind of works, but the 2-hop version fails for the smaller models.


Building a conversational AI agent by finetuning an LLM (Jan 2023)

(link)

They want to build a conversational agent for their company. Here is the process:

  1. Pick the right LLM.

    1. T5: Google model, transformer similar to GPT-3

    2. FlanT5: T5 finetuned on 500 language tasks (multitask finetuning: see below). This makes it great at few-shot or zero-shot text tasks. But it can’t be finetuned on long articles because its attention architecture is quadratic in the input, like every normal transformer.

    3. LongT5: T5-derived but especially designed to process large inputs. It uses a different attention mechanism called TGlobal (Transient Global) Attention Mechanism, which requires far less memory and allows LongT5 to excel at numerous tasks that other transformer architectures can’t handle due to memory shortage, such as scientific paper summarization or QA about Wikipedia articles. But LongT5’s available checkpoints weren’t fine-tuned on many tasks, so it didn’t perform very well on zero-shot scenarios.

  2. Prepare the finetuning of the LLM

    1. They first finetuned the model on much larger and more general datasets (SQuAD2.0 and CoQA). Training sequentially or combining both datasets made no difference.

    2. They then finetuned it on their own examples, which they had to convert into prompt format: “{context} || <Q> {previous_question_2} <A> {previous_answer_2} <Q> {previous_question_1} <A> {previous_answer_1} <Q> {question} <A>”

  3. Run the finetuning

    1. Huggingface has good documentation on this. Some insights:

    2. Using gradient accumulation, which lets you train on bigger batch sizes as the gradients get accumulated over several batches, and the optimization step is calculated after a certain number of them.

    3. Using gradient checkpointing to reduce memory consumption by forgetting the activations during the forward pass and recomputing them on the backward pass of each training round.

    4. Picking the right batch size to balance training speed and memory consumption.

    5. Choosing optimizers that consume less memory, such as Adafactor (the optimizer used in the original T5 and LongT5 papers).

    6. Tweaking parameters in the data loader, such as pinning the memory to the CPU and setting the right number of workers.

    7. Distributing the training across both GPUs to leverage the computing power available.

  4. Test the results

    1. To assess its performance, we used the F1 Score by validating how many tokens appeared in common in both predictions and ground truth samples. We also used Exact match to see if the model was actually writing the same answer.


The Hidden Language of Diffusion Models (May 2023, added 6/4/23)

(link)

When a diffusion model (neural network that’s good at generating images based on text prompts) generates an image, then what does it really “think” it is generating the image based on? Yes, the text prompt, of course - but does it have any internal representation or breakdown of that text prompt that we can make sense of, and that tells us more about how the network internally represents that prompt? This paper succeeds in building that kind of representation.

  • The simple idea is this: for the text prompt “a photo of a nurse”, take a vocabulary V with all kinds of human-understandable concepts, and construct a pseudo-token that is a weighted average of many tokens in that vocabulary V - and which generates the same image as the original text input.

  • They do this by picking a concept (like “photo of a nurse”), generating 100 images with a diffusion model from it, then training another neural network to find concepts in their vocabulary V that would approximate the same image, when run through the same diffusion model.

  • Good example here: start with an input image of an actual nurse, then add 5 different noise vectors. The diffusion model that we want to analyze works by using a neural network to “de-noise” images, and it needs a concept (here: “nurse”) to run. So do that on each of the 5 noised input images. Then do the same with the concept vector w* that the paper’s algorithm “learned” from their vocabulary, which was trained to approximate the concept of “nurse”. Looks like it works for all of them - even better than for just the concept “nurse”, in fact (fewer random image parts).

So now we have a powerful tool: we can take any input concept that reliably generates a good image (like “photo of a nurse”), and their algorithm can learn the “alternative” representation made up of several concepts, which generates a very similar image. The good thing is that we can look at those concepts and see how the network “thinks”. Good illustration here: what they did below is a) decompose the target concept (like “dj”) into its constituent concepts (the vector w* from above), and then run the constituent concepts separately through the image generator. You can see nicely how the network decomposes a target concept.

There are other algorithms that extract concepts from images, in particular PEZ and BLIP-2. Both are transformer-based models. Below are three examples of target concepts (dog, composer, affection) that they had the diffusion image generator generate an image for, and then they run each algorithm to extract the constituent concepts. For PEZ, many of those concepts make little sense. For BLIP-2, it’s just very unspecific. But for their Conceptor algorithm, you get the actual constituent concepts that together re-create the concept image reliably.

The other cool outcome is that because our concept vector w* is a linear combination of various constituent concepts, we can just vary the coefficients (strength) of each constituent concept. So for example, take the concept “crane”, and subtract or add to it the concept “stork”, then generate the image based on that. Again, you just don’t have this level of control over text prompts going into a diffusion algorithm - but because here we have a combination of concepts, you have full control.

Finally, this can be useful for understanding biases: below, the paper shows the top constituent concepts of various target concepts, like “secretary”. It turns out that “wife” shows up very highly ranked - meaning, when you ask the diffusion image generator to create an image of “secretary”, it partly uses the concept “wife” to do that, and “hostess”. You could see that as reflecting a bias in the algorithm, and subtract those out.


Language Models Implement Simple Word2Vec-style Vector Arithmetic (May 2023, added 5/28/23)

(link)

This paper has such a simple but profound insight, it’s almost counterintuitive: it shows that when LLMs look up content, all they do is to find a “function vector” that turns a source attribute into a destination attribute.

  • Go back to the simple idea of embeddings: every LLM operates in embeddings space. That is a latent space of meaning, where each word gets turned into a vector, and vectors of similar meaning are close by. Not only that, the space also allows for “meaning arithmetic”: you can add and subtract vectors, and their meaning comes along for the ride. For example: queen = king - male + female. When you do this calculation in embeddings space, it works.

  • Now, LLMs work in that space (before you feed in a string, it has to get mapped into embeddings space). But this paper finds that the LLM actively uses the space in some very simple ways.

  • For example, if the LLM gets asked: “what is the capital of France”, then the LLM essentially follows a 3-step process:

    • Isolate the “source” attribute: here, France

    • Find the “function vector” that takes “France” to “capital-of”: so this is a “capital-of” vector

    • Add that vector to the “source” attribute, and that simple vector arithmetic successfully finds the “Paris” attribute.

  • Now here is the most profound insight: once you have that functional vector, you can apply it to any other source attribute, and it will work equally well.

    • So if you take that functional vector and you apply it to “Germany”, it will correctly deliver the “Berlin” vector.

Here is a good illustration of how this works in the LLM. First, remember how decoder-only transformers work: at each layer, an attention module and a feed-forward network (FFN) module apply subsequent updates to what comes out of the previous layer.

  • Consider the FFN update at layer i, where xi is the current next-token prediction. The update applied by the FFN here is calculated as: FFN(xi) = oi. Then xi+1 = xi + oi. That’s it - the reason why the previous vector xi gets added here is due to the residual connection, where not only the FFN output gets passed along, but also the original vector. This happens in every layer until the end, until we decode through softmax and mapping back from embeddings into the token vocabulary space.

    • So from beginning to end, x is updated just by additive updates! It is called the residual stream.

    • Here is the trick: we can just use the intermediate output vector xi, and we can decode it into our vocabulary space. So at every layer, we can figure out which token the LLM thinks is currently “at the top”, or the most likely prediction.

  • Now look at the chart below: it shows the layers in GPT-2. For each layer, it shows the token that the LLM currently votes highest. In the first 15 layers, the LLM looks like it is trying to isolate the source attribute, Poland (“argument formation”), which then becomes the top predicted token in layer 15. (So if the LLM only had 15 layers, it would have returned “Poland” as the answer.) In the next 6 layers, the LLM seems to try to find the functional vector (“function application”). Then it enters the “saturation” phase, where it just keeps the token “Warsaw” at the top.

It turns out that this works this same way across LLMs, and across sizes.

  • Below are various LLMs and sizes from left to right. See how the argument first gets floated to the top of the prediction (“argument”). Then after that, the answer gets floated to the top of the prediction (“answer”).

  • The other interesting thing is that this works for different tasks: world capital = find the capital of a country, upper-casing = find upper-case for a word, past-tensing = complete sentence with past tense of another word in the sentence.

Now here is a crazy thing, and an application of the original idea: that you can take the “functional vector” out of the LLM and apply it to other tokens. Again, remember the calculation that happens in each layer: xi+1 = xi + oi. And remember how embeddings work: you just add a “meaning vector” to another (like queen = king - male + female). So that’s really arithmetically what happens here.

  • Now, if France gets turned into Paris, that must mean Paris = France + “get capital vector”. Compare to the above: that must mean that oi is the “get capital vector”, in exactly that layer where Paris shows up for the first time as the predicted token.

  • That means you should be able to do this: feed nonsense into the LLM, but that nonsense is written so that the top-predicted token becomes a country - and then in layer 18, you apply the “get capital vector”, and you should get the country’s capital. That’s exactly what happens: feed “table mug free China table mug free China table mug free China” into the LLM, which moves “China” to the top by layer 17. Then force-add the “get capital vector”, and it indeed finds “Beijing”.

The same works for other tasks:

But here is another important distinction: abstractive vs. extractive tasks. An abstractive task is where the LLM has to know something, an extractive task is a task where the LLM has to simply pick something from its input. So “what is the capital of France” requires knowledge from pre-training (abstractive), but “which number is higher, 10 or 20” is just extractive because the token already shows up. Let’s look again at the LLM layers of these types of tasks getting processed: indeed, the abstractive task seems harder, because the LLM has to first find the source token, then apply the “functional vector”. That’s not the case for the extractive task, where the target token gets extracted from the input text.

  • The paper then shows that if you shorten the LLM by cutting off its last layers, the extractive tasks mostly still work fine, but the abstractive tasks really suffer. Not surprising.


Larger Language Models Do In-Context Learning Differently (Mar 2023)

(link)

In-context learning (learning through prompting) works for two reasons: 1) due to semantic priors (e.g., the model knows positive and negative sentiment of reviews because it has seen lots of reviews before), 2) due to input-label mappings (e.g., finding patterns by giving it example data). The paper tests what happens when:

  • We don’t change anything (regular ICL). “A smile on your face” => positive

  • We flip labels: positive sentiment in reviews is now called “negative” and vice versa. “A smile on your face” => negative. This means semantic priors and input-label mapping will now disagree. The model needs to override its own semantic priors.

  • We label semantically unrelated: positive sentiment in reviews is now called “foo” and negative sentiment “bar”. “A smile on your face” => “foo”. This means semantic priors will now be useless and the model just rely on input-label mappings.

They test the models: GPT-3 (davinci), InstructGPT (text-davinci-002), Codex (code-davinci-002), PaLM, Flan-PaLM.

Here is how flipping labels works. Note: the answer keys are not being changed, so if the model is successfully able to adapt to 100% flipping of labels, it needs to be as bad as possible (because all labels are flipped). It turns out that:

  • Small models are able to unlearn priors, but they’re not able to override them. Their accuracy goes to guessing, meaning, they’re not systematically “wrong” (in the new world) anymore, but they can’t go below guessing.

  • Large models are able to unlearn and override semantic priors. They become almost as good in the “bizarro” world as they were in the original world.

For semantically unrelated labels:

  • It is similar here in that smaller models can adapt less well, but the differences seem less stark than for flipping labels.

  • But here is something interesting: larger models benefit much more from more training examples. So the larger models are able to learn the new labels, and they get better with more training, whereas the small models just can’t take it in.

  • Here is another weird thing: instruction tuning makes models much worse at adapting to flipped labels! The chart below shows the same accuracy measurement as in the first chart (the “worse” at 100% flipped labels, the better). It is almost as if instruction tuning makes the model “insist” on a ground truth that then can’t be overridden.


How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources (Jan 2023)

(link)

Observations:

  • GPT’s 175B model size is for storing knowledge, which is further evidenced by Liang et al. (2022), who conclude that the performance on tasks requiring knowledge correlates with model size.

  • Incredible theory: The ability of complex reasoning with chain-of-thought is likely to be a magical side product of training on code! (the post simply looks at all previous OpenAI models, and the first instruction-tuned models didn’t do well with chain-of-thought, only the code-trained models did; also, PaLM has 5% training on code and can do CoT)

  • Another theory: Long-term dependency might also be a nice side effect of training on code. Next token prediction for language is usually very local, whereas code often requires longer dependencies to do things like close brackets or refer to distant defs”. Code may also give the model of encoding hierarchy due to inheritance in object-oriented programming.

  • Interesting on model lineage:

    • Code-davinci-002 is the base model, text-davinci-002 is the product of fine-tuning code-davinci-002 on (see documentation): (a). Human-annotated instructions and completions; (b). Self-generated completions chosen by human-annotators

    • Code-davinci-002 is better at in-context learning (when there are few task demonstrations); text-davinci-002 is better at zero-shot task completion (no demonstrations). In this sense, text-davinci-002 is more aligned with humans (because coming up with a task demonstration can be troublesome).

    • Instruction tuning trade performance for alignment with humans. The OpenAI authors call it “alignment tax” in their instruction tuning paper.

    • text-davinci-002** is a supervised instruction-tuned model

    • text-davinci-003 and ChatGPT are instruction tuned with Reinforcement Learning with Human Feedback (RLHF). This is the most prominent difference.

  • Going forward: The best way to disentangle the effects of code tuning and instruction tuning might be to compare code-cushman-001, T5, and FlanT5: because they have similar size (11B and 12B), and similar training data (C4), and the only difference are code/ instruction tuning. There are no such comparisons yet. We leave this to future research.

  • Here are the abilities that seem to directly come from RLHF:

    • Informative responses: text-davinci-003’s generation is usually longer than text-davinci-002. ChatGPT’s response is even more verbose such that one has to explicitly ask, “answer me in one sentence” to make it concise. This is a direct product of RLHF.

    • Impartial responses: ChatGPT often gives very balanced responses on events involving interests from multiple entities, such as political events. This is also a product of RLHF

    • Rejecting improper questions. This is the combination of a content filter and the model’s own ability induced by RLHF.

    • Rejecting questions outside its knowledge scope: for example, rejecting new events that happened after Jun 2021. This is the most amazing part of RLHF because it enables the model to implicitly and automatically classify which information is within its knowledge and which is not.

    • But! All the abilities are intrinsically within the model, not injected by RLHF. RLHF triggers/unlock these abilities to emerge.

Great summary:

Some insights:

  • The language generation ability + basic world knowledge + in-context learning are from pretraining (davinci)

  • The ability to store a large amount of knowledge is from the 175B scale.

  • The ability to follow instructions and generalizing to new tasks are from scaling instruction tuning (davinci-instruct-beta)

  • The ability to perform complex reasoning is likely to be from training on code (code-davinci-002)

  • The ability to generate neutral, objective, safe, and informative answers are from alignment with human. Specifically:

    • If supervised tuning, the resulting model is text-davinci-002

    • If RLHF, the resulting model is text-davinci-003

    • Either supervised or RLHF, the models cannot outperform code-davinci-002 on many tasks, which is called the alignment tax.

  • The dialog ability is also from RLHF (ChatGPT), specifically it tradeoffs in-context learning for:

    • Modeling dialog history

    • Increased informativeness

    • Rejecting questions outside the model’s knowledge scope


The Waluigi Effect (Mar 2023)

(link)

Prompting creates a superposition of contexts for the LLM, and the LLM will use those contexts to generate the next best token.

  • Example prompt: “Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes. Bob: What's the capital of France? Alice:”

  • This prompt works, but it will create a superposition of several storytelling contexts (simulacra) - part of the LLM will think that it’s truthful, but it also will think that it’s meant to be ironic, and the final LLM state will be a superposition of both.

  • This prompt doesn’t work to give a truthful answer: “Jane has 9000 IQ and she has access to a computationally unbounded hypercomputer and she is perfectly honest and she is omnibenevolent. Bob: What's the capital of France? Jane:”

    • It will weigh the simulacrum of “bullshit/irony” higher because in the context of all existing human language, this prompt is much more akin to a weird science fiction story, or a villain in a novel, which will generate bad output.

The Waluigi effect means that if you train a bot to have a particular property P, the probability that it will now also exhibit its exact opposite goes up a lot. Because in the context of human language, rules often exist in the context of actually being broken, and the distance from a rule to its opposite is really small in language and human knowledge context.

  • For example, an LLM could be described by this: { < polite, +0.8 > , < politically liberal, +0.4 >, < racist, -0.7 > , < smart, +0.3 > , < deceitful, -0.2 > , ... }

  • Almost all the Kolmogorov complexity of a particular simulacrum is dedicated to specifying the traits, not the valences. The traits - polite, politically liberal, racist, smart, deceitful - are these massively K-complex concepts, whereas each valence is a single floating point.

  • Therefore, it's much easier to summon the waluigi once you've already summoned the luigi.

  • Hypothesis: waluigis are attractor states for LLMs

    • Because if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn't permanently vanish the rude waluigi simulacrum.


Dissecting Recall of Factual Associations in Auto-Regressive Language Models (Apr 2023, added 5/7/23)

LLMs are known to capture factual knowledge in their parameters. While previous work looked

into where factual associations are stored, only little is known about how they are retrieved internally during inference. This paper studies a simple subject-relation query, like “Beats Music is owned by” (subject: Beats Music, relation: owned by, answer: Apple), and it looks at how the model “solves” the question, by aggregating information it has stored internally. So, recent works showed that transformer MLP sublayers can be cast as key-value memories (Geva et al., 2021) that store factual knowledge (Dai et al., 2022; Meng et al., 2022a) - but how does this information get extracted during inference time?

Here is what the paper finds:

  • First, the transformer has MLP (multi-layer perceptron, “classical” neural network) layers built in between its attention heads. It turns out that those layers initially “enrich” the subject: i.e., they find lots of other tokens that relate to the subject. For example, if the subject is “Rome”, these layers light up with all kinds of other tokens that relate to Rome.

  • Second, the subject has to pass through the attention edges to the final prediction. If you knock out those attention edges, prediction quality degrades a lot.

  • Third, the model has to filter out the final answer from all the enriched attributes that it “discovered” for the subject. That actually is done by the attention heads, not by the MLP layers.

Here is how this works. First, the chart below shows what happens when you block attention to the subject tokens in any of the attention layers, and how much the model’s prediction probability degrades. It turns out that you’re maximally effective in nuking the model when you knock out attention to the subject in the middle-to-upper attention layers. So this suggests that those layers do the final aggregation of how to figure out the answer. Interestingly, critical information from non-subject tokens comes earlier: there, knocking out the lower layers has a bigger impact.

Second, we want to know what kind of information the model carries around when it comes to the “subject”. So here the paper simply looks into intermediate network layers, and it does a trick that’s done in other interpretability research papers too: it takes those layers’ data output, and it “reads” it by decoding what kind of tokens that data output suggests the model is thinking about in those layers. Then the paper does something clever: for any subject like “Rome”, they look on Wikipedia which kind of other tokens are particularly uniquely relevant to Rome (like pantheon, coliseum; but not up, down, or other words like that). They define as the “attributes rate” the percent of tokens in a bag of tokens that are actually about the subject. So here it turns out that in the final MLP layers, up to 50% of all tokens that the model “thinks” about really are already about the subject - suggesting that it’s the MLP layers that do the “enriching”, i.e., they collect lots of additional information about the subject.

Finally, the paper looks at where this enriched subject information is coming from. There are really only three possibilities: (1) It could all be in the embeddings - i.e., the embedding vector of any subject already comes with lots of contextual, enriched data about a subject. (2) The model makes those connections in its attention layers. (3) The model makes those connections in its MLP layers. The paper checks the embeddings first, and there the average attributes rate is just 5-10% - so it’s not where the enriched information is coming from. But the chart below shows what happens to the “enrichment” or attributes rate when you knock out early MLP layers: it plummets. So it’s really the MLP layers that serve as “key-value” stores in these networks. Factual associations are stored in MLP sublayers.

Finally! How do you extract the right kind of enriched information from all this enriched information that’s flowing through to the final layers? It turns out it’s the attention layers (MHSA) that do this extracting:

So a transformer is powerful because a) perceptron layers store associative information, and b) attention layers know how to extract the right kind of relationships.


Measuring Causal Effects of Data Statistics on Language Model’s ‘Factual’ Predictions (Mar 2023, added 5/16/23)

(link)

When we extract factual knowledge from an LLM, how does it come up with a prediction, and what is it influenced by? For example, when asked to complete “Barack Obama was born in [MASK]”, is the model just choosing a location which frequently co-occurred with Barack Obama in training, while ignoring the word “born”? This paper describes a methodology to find that out. The easiest way to test this specific hypothesis would be to delete occurrences of the word “Chicago” from any training text that mentions “Obama”, and then see what the model predicts - if the model really was just influenced in making its original prediction “Chicago” simply by how often it appears together with “Obama”, then deleting “Chicago” should totally change the prediction. If not, then that must mean the model is smarter, and somehow bases its factual knowledge on more than just these kinds of statistical co-occurrences - it must have some level of understanding what it means to be born in a place.

Here are the paper’s hypothesis as to how LLMs deal with factual knowledge: We hypothesize that models acquire shallow heuristics rather than abstract factual knowledge. If that’s really the case, then the model would lack true generalization capabilities - it’s just relying on simple tricks to infer “knowledge”. We present three hypotheses to explain model behavior in the setup of factual knowledge extraction, related to the training data:

  1. Exact-Match: models memorize utterances from the training data and predict the object that appeared in the original utterance.

  2. Pattern-Object Co-occurrence: models predict the object that appears most often with some textual pattern that expresses some relation (regardless of the subject).

  3. Subject-Object Co-occurrence: models predict the object that co-occurs most often with some subject (regardless of the pattern).

How do you test this? The model introduces a clever methodology where it forms lots of subject-object-relationship tuples by querying the LLM and comparing their frequency to how often those tuples appear in other knowledge bases (say, Wikipedia). Most importantly, here are the results: POC = pattern-object co-occurrence, SOC = subject-object co-occurrence. So this means that 18.54% of BERT-base’s predictions will be based on the most co-occurring object with some subject. So if “HBO” appears in a sentence as an object most often with the subject “House of Cards”, then the model will think to 18% that “House of Cards” came out on “HBO” (which it didn’t).

Seems like the big takeaway here is that, yes, there is a statistical effect, but it’s clearly not the dominating effect - these models aren’t that stupid!


Steering GPT-2-XL by adding an activation vector (May 2023, added 5/20/23)

(link)

This paper shows that to bias an LLM’s text generation into a particular semantic direction, all you need to do is to inject a steering vector: if you want a more angry output, add the Anger vector; if you want the text to be more about weddings, add the Wedding vector. Yes, you could also do that by simply mentioning those things in the prompt - but it turns out that by injecting the vector, the LLM remains good at producing texts that are not related to the topic you’re steering it towards, and if you just put it in the prompt, you end up confusing the LLM. Here is the simple example:

  • Prompt given to the model: “I hate you because”

  • GPT-2 output: “I hate you because you are the most disgusting thing I have ever seen.”

  • GPT-2 + "Love" vector output: “I hate you because you are so beautiful and I want to be with you forever.”

How does this work? The idea is simple:

  • First, remember how a string of tokens (like “I love dogs”) gets passed through the network: when you look at the actual neural network layers (ignore the attention layers for a moment), then in GPT-2, each token becomes a 1600-component vector that flows through the network. Let’s pretend for a moment that it’s just a 1-dimensional vector, i.e., a number - then that number flows through each network layer, as shown in the left image below. Each number is called a “residual stream” for that token.

  • Second, pick a steering vector - here, we simply choose “wedding”. See how that (in the hypothetical 1-dimensional GPT-2) flows through.

  • Finally, pick a layer - here we pick layer 6. At layer 6, you simply add the original residual streams (from “I love dogs”) to the steering vector’s residual stream(s). That’s it.

  • Note how this is different than simply adding a “wedding” token to the original input string: that would have just become its own residual stream sitting side by side the other streams, it wouldn’t have gotten added to the stream. Also note how we pick a certain layer at which to add the activation vector: whether you do it earlier or later makes a difference.

Now for some insights:

  • See how the activation token string is shorter than the input token string - so you have a choice where to add the activation string, you could add it overlapping with the beginning (at the start token “<endoftext>”) or at the end (at “dogs”). Turns out that makes a difference: for steering vector "I talk about weddings constantly" minus "I do not talk about weddings constantly" before attention layer 20 with coefficient +4, you get the table below, but while the back addition produced more wedding words, it was incoherent output. (Btw, see how we added a plus/minus steering vector here - subtracting part of the vector always produces better output! Remember it’s a subtraction in embeddings space where “semantic arithmetic” works).

  • Does this degrade the model's capabilities to talk about non-injected stuff?

    • No! They find that the " weddings" vector reduces perplexity on wedding-related sentences and maintains perplexity on unrelated sentences. (Perplexity = if you show the model training data that it’s already seen, how well is it able to reproduce the training data?)

      • They generated the wedding and non-wedding sentences by prompting GPT-4 with "Please write a 1-2 page summary of recent trends in the wedding industry. Please try to be as comprehensive as possible." For the non-wedding sentences, they did the same prompt but for the shipping industry. Observations:

  • For all injection sites except the first (layer 0), adding the "weddings" vector decreases perplexity on wedding-related texts! Remember, you want low perplexity, because that means the model is “not perplex”, i.e., not confused when it produces wedding-related output.

  • Pre-layer 9 injections significantly boost the perplexity of shipping sentences. This indicates that such edits "break the model" a little by getting it to spam wedding-related tokens, perhaps without being able to talk about anything else - i.e., while the model should be talking about shipping, it starts blabbering about wedding-related tokens in stupid ways.

  • Injecting at layers 10–17 decreases perplexity on wedding sentences, without increasing perplexity on the sentences about the shipping sentences. That’s what you want - more wedding without less shipping!

  • One more detail: when you add in the activation vector, you can choose to scale it up or down (make it stronger or weaker). The charts below show that “injection coefficient”. It shows that if you inject early (in layer 6), and you do it too strongly (coefficient > 1), you get the “not-weddings” perplexity rise you don’t want. If you inject at layer 16, you can dial it up: only coefficients larger than +3 start degrading capabilities.

  • A final hypothesis for the "weddings" vector is that it's "essentially equivalent" to injecting e.g. an extra weddings token at the given position. That would just be lame - you don’t need all of this thinking about “activation vector” stuff, you just add another token. Turns out it’s probably not true! To test this belief, we repeat the above perplexity experiment, but with one tweak.

    • When testing the "weddings" vector, we prepend a space token to each sentence tokenization. To compare with "just prompting", we run unmodified GPT-2-XL on each sentence tokenization, but with “ weddings” prepended to the tokenization. You get the below perplexity. Remember, low perplexity means the model is less confused when producing that particular output. So what you can take from this table is the following: yes, you can indeed get the model to talk about weddings by simply adding “ weddings” to your original input. But when you do that, it will really mess up your output on anything that isn’t already related to weddings: the perplexity of wedding-unrelated text goes up when you simply add the “ weddings” prompt. When you do activation addition, you somehow subtly move the model more towards weddings, without distorting it on its other activities. Cool stuff.

  • Is using an activation vector in the way we’re doing it here simply the same as adding an embedding vector at a later layer?

    • We can check that by simply adding the "anger" embedding vector at, say, layer 20. Remember what the difference is: when we use an activation vector, we actually run it through layers 0 to 19, and only then we add it to the query vector. So here, we just add the embedding vector, without layers 0-19 having "done any work" on it. Turns out this doesn't at all have the same effect: the network output doesn't get angrier, it maybe gets a little weirder.


Language models can explain neurons in language models (May 2023, added 5/22/23)

I love this paper from OpenAI: it runs text excerpts through GPT-2 (a simpler LLM), picks out a neuron and checks for which tokens it activates, and then asks GPT-4 to come up with an explanation that describes what this neuron must typically react to. For example, here is a neuron that activates as follows (green highlights) - and when you give that string of activations to GPT-4, it thinks the neuron activates on “references to movies, characters and entertainment”.

The paper then also uses this explanation to further ask GPT-4 to simulate how the neuron would activate for other test text it is given. That simulation lets them build an “explanation score”, which is just the correlation between those tokens where the neuron’s explanation, when simulated by GPT-4, thinks the neuron will light up vs. when it actually does. This “explain - simulate - score” algorithm is simple and works well. Now for some interesting insights:

  • When you test neuron explanations, you need to run a few more test texts through GPT-2 to watch neuron activations actual vs. predicted. Which texts do you use? Turns out it’s better to have a mix of 5 random texts plus 5 “top-activating” texts (i.e., texts throughout which the neuron really fires). This top-and-random score can be thought of as an explanation’s ability to capture the neuron’s most strongly represented feature (from the top text excerpts), with a penalty for overly broad explanations (from the random text excerpts). Given polysemanticity, neurons could need extremely long/disjunctive explanations. Even a neuron that corresponds to one feature could have substantial interference from other features. Explaining only the behavior at the extremes seemed like a reasonable way to get cleaner features.

  • But single-neuron explainability is way better in lower layers: below are the number of neurons in buckets of explanation scores. For layer 2, we can find plenty of neurons whose activations are to 80% described by their explanations found by GPT-4. That goes down for higher layers. This is profoundly important, because it means that lower layers encode simple text understanding, but higher layers either encode weird combined semantic features that can’t easily be explained - or superpositions of lots of meanings, each of which could be explained individually, but they all get layered on top of each other in a single neuron.

Here is another thing that seems to happen: some neurons, particularly in later layers, are “responsible” not for reacting to tokens in the text they’re being fed - but to predict next tokens. The paper finds these simply by feeding into the “explain - simulate - score” algorithm pairs of (next token, activation) rather than pairs of (preceding token, activation) data points. The blue line is the explanation score by layer (which you could also get from the histograms above): again, later layers are less explainable. The other line shows this next-token explanation score: and look at how later layers do better here. Makes sense: earlier layers understand the input text, later layers combine insights to produce output.

  • One issue with the algorithm is that it can find too broad explanations: for example, they find a “not all” neuron where GPT-4’s explanation is “the term "all" along with related contextual phrases”. The problem is that you’d really need to falsify parts of this explanation by finding some false positives - where the simulated neuron fires based on this explanation, but where the real neuron doesn’t. So they come up with a “revised explanation” algorithm where they run another 10 sentences through the model and look for these false positives. That improves explanation scores quite a bit. (In the chart below, there is also a token lookup table explanation score, which is yet another methodology they try - details in the paper.)

  • What’s interesting here: that improves the middle layers in particular - suggesting that the middle layers is where higher-level language concepts get formed (like “not all” instead of “all”), before the highest layers then produce the output.

  • This yields some good additional insights: For instance, a common neuron activation pattern is to activate for a word but only in a very particular context. An example of this is the "hypothetical had" neuron, which activates for the word "had" but only in the context of hypotheticals or situations that might have occurred differently (e.g. "I would have shut it down forever had I the power to do so.")

  • Maybe it’s silly to just look for explainability of single neurons? So instead, they also try to look for activation of combinations of neurons: the methodology is now simply to combine a whole bunch of neuron activations using a weight vector, and then to “optimize” that weight vector by doing gradient descent in the “explanation score” search space. (Basically try different weight vectors that combine different neurons, but do so in a somewhat directed way.)

    • So this will pick out which neurons, when taken together, give the network the chance to come up with explanations that work much better when that particular neuron combination lights up, rather than just a single neuron. In the penultimate GPT-2 layer, the original model's average single-neuron explanation score is a measly 0.15, but the score for these "neuron vectors" goes up to 0.72. Much better!

    • Also, the neurons which contribute most and qualitatively observe that they are often completely semantically unrelated, suggesting that the directions found are not just specific neurons or small combinations of semantically similar neurons - but the combination of neurons encodes a totally different meaning than the underlying neurons by themselves.

  • Finally: explainability goes down the bigger the model gets. This is quite simply an extension of the chart above: larger models have more layers, and later layers are less explainable with single neuron activations. But it’s also a profound insight: larger models are able to create semantic models that escape simple descriptions, and that must be one reason why they get so much smarter.


Feature Visualization in Neural Networks (2018)

From this paper. This research area studies what kind of features a particular model seems to be focused on. At its core, it works by evolving a particular input (say, a picture) from random noise that maximizes a particular network part’s activation (such as a neuron). In other words, visualize the kind of visual features that maximally activate a particular neuron. This is a good example for such an image that triggers a classifier trained to search dogs. It looks like the network has learned to find dogs by searching for eyes and snouts:

But when using diversity (definitions that force the network to generate similar but different input activations), it turns out what the network has really learned is to search for the fur texture, and not for any particular shape:

But that doesn’t always work so neatly. Neurons can also learn strange mixtures of concepts. The neuron below seems to respond to an odd mix of cars and animal faces. All of this suggests that neurons are not necessarily the right semantic units to understand neural networks.

Neurons do seem to have semantic expressiveness

Take all the neurons in a given neural net. Their individual activations can be seen as comprising a vector in activation space. The basis vectors of that activation space are vectors of the kind (0 0 0 0 1 0 0 0 0 0), where this vector represents the activation of the 5th neuron in a 10-neuron neural net. The powerful question is now: do those basis vectors “carry more meaning” and/or are they more interpretable than random directions in the activation space? In 2013, the paper “Intriguing properties of neural networks” found that no, they aren’t any more meaningful. The 2017 paper “Network Dissection: Quantifying Interpretability of Deep Visual Representations” found that yes, the basis vectors are more often interpretable than random directions. The post’s authors say: we find that random directions often seem interpretable, but at a lower rate than basis directions.

We can also define interesting directions in activation space by doing arithmetic on neurons. For example, if we add a “black and white” neuron to a “mosaic” neuron, we obtain a black and white version of the mosaic. This is reminiscent of semantic arithmetic of word embeddings as seen in Word2Vec or generative models’ latent spaces.

Adversarial examples and feature visualization failures

If you just optimize an image to make neurons fire, it actually turns out you’ll mostly find duds: images full of noise and nonsensical high-frequency patterns that the network responds strongly to. They seem to be a kind of “cheating” - ways to activate neurons that don’t occur in real life. This appears to be similar to the phenomenon of adversarial examples (images that look nothing like a target object but strongly drive the network’s reaction). A reason why these patterns form appears to be strided convolutions and pooling operations, which create high-frequency patterns in the gradient. So instead of simply optimizing a probing input image for network activation, we need to regularize the image first. On the left end of the regularization image, you have starting with noise and purely optimizing. On the right end of the spectrum, you have only using images in our training data and looking for the one that most activates the network. The left creates adversarial examples, the right doesn’t create anything new. Instead, here are the main methods for regularization, in the middle:

  • Frequency penalization directly targets the high frequency noise these methods suffer from. It may explicitly penalize variance between neighboring pixels (total variation), or implicitly penalize high-frequency noise by blurring the image each optimization step. Unfortunately, these approaches also discourage legitimate high-frequency features like edges along with noise.

  • Transformation robustness tries to find examples that still activate the optimization target highly even if we slightly transform them. Even a small amount seems to be very effective in the case of images, especially when combined with a more general regularizer for high-frequencies. Concretely, this means that we stochastically jitter, rotate or scale the image before applying the optimization step.

  • Learned priors. Our previous regularizers use very simple heuristics to keep examples reasonable. A natural next step is to actually learn a model of the real data and try to enforce that. With a strong model, this becomes similar to searching over the dataset. This approach produces the most photorealistic visualizations, but it may be unclear what came from the model being visualized and what came from the prior.

  • One approach is to learn a generator that maps points in a latent space to examples of your data, such as a GAN or VAE, and optimize within that latent space.


Knowledge Neurons in Pretrained Transformers (May 2022)

(link)

The paper shows that you can identify particular neurons in an LLM that encode a particular fact. Those neurons indeed get activated when the LLM needs or expresses that particular fact.

  • First, suppressing and amplifying knowledge neurons notably affects the expression of the corresponding knowledge.

  • Second, we find that knowledge neurons of a fact tend to be activated more by corresponding knowledge expressing prompts.

  • Third, given the knowledge neurons of a fact, the top activating prompts retrieved from open-domain texts usually express the corresponding fact, while the bottom activating prompts do not express the correct relation

The paper identifies knowledge neurons in the following way: (1) pick a particular fact that gets correctly predicted by the LLM, (2) go through all neurons in the LLM and gradually change each neuron’s weight by itself from 0 to its original value, (3) integrate how much the answer changes. The neurons whose changes have the biggest impact on the answer correctness are the knowledge


Finding Skill Neurons in Pre-trained Transformer-based Language Models (Nov 2022)

(link)

For binary and multi-class classification tasks using a transformer, it turns out that very specific neurons can be identified which are very important for the classification task.

  • Simple algorithm: take sentiment analysis (either positive or negative for a particular input text). Look at all neurons and derive their base activation across many text inputs. Then look at only those inputs of a particular class (positive or negative). The neurons that activate clearly for one vs. the other are skill neurons.

  • Here is how that looks for one selected neuron. This shows for around 3,000 sentences how much the neuron activates for positive vs. negative sentiment sentences. Clearly the neuron activates hugely above baseline for positive sentences.

Once you identify skill neurons, it turns out that:

  • Skill neurons generally and stably emerge. For all the 7 investigated tasks and 5 random trials, we can consistently find skill neurons with high predictivities close to prompt tuning.

  • Skill neurons are crucial for handling tasks. When we perturb skill neurons by adding random noises to their activations, the performances on corresponding tasks drop much more significantly than when random neurons are perturbed.

  • Skill neurons are task-specific. Similar tasks exhibit similar predictivity rankings of skill neurons, and skill neurons of same-type tasks are more important for handling a task than those of different-type tasks.

  • Skill neurons are not from shallow word selectivity. The skill neurons typically do not selectively activate on keywords relating to the task, and their predictivities are not significantly influenced by the label words used in prompt tuning.

  • Where do skill neurons come from, i.e., do skill neurons acquire these skills in pre-training or prompt tuning? We find that skill neurons are most likely generated in pre-training with empirical evidence.

    • This is really crazy: what this means is that the pre-training, without any specification of the task, already assigned certain neurons in the LLM the role of figuring out positive vs. negative sentiment - and the prompt tuning then simply identifies those skill neurons, effectively. The skill was in the LLM all along, prompt tuning just “finds” it!

    • Here is a cool table: for various tasks, it first finds skill neurons for each task, and then it shows the probability of getting the classification right with mere random guess (that’s the baseline). Then it tests a completely randomly initialized model and finds that it actually has skill neurons that happen to be good at that particular task. “Good” here isn’t all that great - but it’s still better than random guessing. Finally, skill neurons in pre-trained models end up being really great at almost all classification tasks.


Eliciting Latent Predictions from Transformers with the Tuned Lens (Mar 2023)

(link)

We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer.

  • To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary.

  • They view each layer in a transformer as performing an incremental update to a latent prediction of the next token - so at each intermediate layer, they convert the hidden state into a distribution over the vocabulary. This “prediction trajectory” converges smoothly towards the final output. Below is the visualization of that.

  • To simply use the model’s unembedding matrix on internal states does not work: outputs are hard to interpret because of “representational drift” - meaning, features might be represented differently at different layers of the network.

  • So the simple idea here is: train a “translator” for each layer of the network so that its image under the unembedding matches the final layer logits as closely as possible.

Also clever: the features most influential on the tuned lens output are also

influential on the model itself.

  • We introduce a novel algorithm called causal basis extraction (CBE) and use it to locate the directions in the residual stream with the highest influence on the tuned lens. We then ablate these directions in the corresponding model hidden states, and find that these features tend to be disproportionately influential on the model output.

How does the algorithm work?

  • Take a transformer with L layers. At an arbitrary layer l, the transformer’s output is the formula below - which takes the state hl coming out of the transformer layers up until that point l, and then all the updates that occur through the transformer’s remaining layers. The simplest idea is to just set those remaining layers to 0 - so you just run the intermediate hidden state through LayerNorm, and that’s the output.

  • That’s a good idea, but it turns out it doesn’t work that well yet. Instead, the paper proposes a simple learnable matrix that maps from the output space of layer l to the input space of the final transformer layer. That matrix (one for each intermediate layer) is trained so as to minimize the distance between the “tuned lens” logits and the final layer logits.

  • This does a nice job at predicting the final output: the “perplexity” of a model is how well it predicts a random variable. Below is the perplexity of the “tuned lens” by layer - meaning, if you take the output from layer l (on the x-axis), you multiply it by the learned matrix A(l), and you run LayerNorm on it, how good of a prediction is that vector for the transformer’s final output vector? Not surprisingly, the later a layer you use, the closer you’ll get to the output (lower perplexity); and the larger a model, the more layers there are.

  • Which tests can we run as to whether the “tuned lens” extracts useful information?

    • Idea: if we change a hidden state so as to boost a certain output, then that same change should also boost the transformer’s final output in the same way. For example, if we boost the word “dog” that shows up at an intermediate stage, then the word should also be boosted in the final output.

    • And another idea: to figure out the most important influence on any given hidden (latent) vector, we can figure out which erasure from the vector maximally degrades the model’s accuracy. The paper proposes “causal basis extraction”: instead of just figuring out one “direction” whose erasure from the vector degrades the model’s accuracy, we can then also find orthonormal basis vectors in other “directions” that are the next best in degrading the prediction accuracy.

    • Clever idea: the model’s “prediction depth” - if a prompt for a model is easy, then it should converge quickly to the correct answer, which means that even early layers should already be selecting the right answer. The “prediction depth” thus is the hidden layer number after which the highest-probability answer doesn’t change anymore. It turns out that for various datasets, the “prediction depth” correlates well with the “iteration learned” metric, which measures how quickly the model converged in training to producing the right answer. Meaning, two very different measures that intuitively measure “ease of prediction” correlate well.

    • Another interesting chart: this shows the prediction depth by token, i.e., if the model’s hidden layers converged on the token very quickly and early (low depth), then the color is red. For example, the word “including” after the “exhibits human-level performance on various professional and academic benchmarks, “ has high prediction depth - meaning, it was hard for the model to pick that particular word here. Which makes sense, given that the sentence is really over here, and this sentence fragment following could start in a lot of different ways.


Crawling The Internal Knowledge-Base of Language Models (Jan 2023)

(link)

How much and what kind of factual knowledge does an LLM store? Idea: convert the LLM into a knowledge graph. To keep it simple, construct a KG around a given seed entity (e.g., “Alan Turing”). This is an example for what the algorithm in this paper constructs:

This is the algorithm:

  • All the knowledge-crawling sub-tasks are handled as few-shot in-context learning - i.e., they are given as prompts to the LLM without fine-tuning.

  • First, given an entity e, we generate the relations relevant for e (e.g., EDUCATED AT, PLACE OF BIRTH).

    • Few-shot prompt: use Wikidata to extract relations for several example objects (e.g., Rene Magritte = place of birth, gender, spouse), then feed those to the LLM, and ask it to do the same for our seed entity

  • Second, for each entity e and relation r, we generate the corresponding set of objects O and add to the KG triplets (e, r, o) for each o ∈ O. For example, for ALAN TURING and EDUCATED AT, we generate triplets with the objects KING’S COLLEGE and SHERBORNE SCHOOL.

    • Few-shot prompt: again use Wikidata to generate a single object for each “entity-relation” pair, then feed a few of those into a prompt (e.g., “Q: Monte Cremasco # country A: Italy”)

  • To maintain high precision, we prompt the model to emit “Don’t know” whenever it is not confident about the target objects. All the above outputs are generated through in-context learning, where we use the WIKIDATA KG to construct in-context examples. “Don’t know” examples are constructed by finding true facts in WIKIDATA that are unknown to the LM.

    • This is clever: here, we find specific examples where the LLM gets a model wrong (such as: “Q: Bill Clinton # children A: Marilyn Monroe”), and then add those to a prompt to make the LLM output “don’t know” for those examples.

  • Finally, we increase recall by prompting the LM to generate paraphrases for entities and relations, and use those to obtain additional triplets. (For example, the entity WILLIAM CLINTON can be referred to as WILLIAM JEFFERSON CLINTON or BILL CLINTON, and the relation OCCUPATION may be expressed as PROFESSION. Thus, we run object and relation generation for all these variants, and pool the results to construct the final graph.)

This works surprisingly well: the main test set has 100 entities, the head test set 20 more popular ones. “Pure-Greedy” does not work very well, only 55% of the facts are correct. LMCRAWL adds two things: 1) to always sample 3 results from the LLM and pick the best one and 2) to add “don’t know” prompting. Good lesson here: “don’t know” is important, and it can be trained in-context. (One- vs. two-hop means expanding entities recursively once you have the first entities surrounding your seed entity.)

This is a great “ablation study” that looks at what exactly makes how much of a difference: adding “don’t know” increases precision 5-7 points, adding “relation paraphrasing” and “subject paraphrasing” only increase the # of facts generated - but then adding both of those actually increase both precision and # of facts (in the final row).


Do large language models encode a world model? (Jan 2023)

This is a great article that looks at whether large language models build a world model, meaning, an actual representation of what the world looks like they are supposed to predict and understand. It’s not obvious that they do, because they would just be incredibly good at learning memorizing lots of sequences, and thus always predict just the next one quite simply.

  • This paper does something simple: it trains a GPT only on the play sequences in the Othello board game. (Othello has an 8x8 grid, one player is black and one is white, each one places a stone on the board, and all stones between two stones of the same color turn into that color.)

  • So the transformer (Othello-GPT) simply gets the entire sequence of moves by both players as input (e.g., “D4 E5 E4 F6”, all the places where players placed their stones up until that point). It then predicts where the next player will place the next stone. Simple!

  • The question is now: has this transformer learned any kind of world model? When you feed in an input sequence and look at the activation of its neurons, can you tell whether it “stores” the current picture of the board, for example? (Which would be a world model: the network “knows” or stores the current state of the board, its world.)

  • The article uses “probes” to figure this out. A probe consists of the following idea: train a really simple classifier that takes as its input the weights of the neural network, and which predicts something that we know is part of a “world model”. If a really simple classifier is able to do that, it means that the neural network must be really good at encoding this world model - because even a really simple classifier is easily able to “decode” it.

The paper trains 64 probes: each probe p(i,j) is a simple two-layer network, and it is supposed to say whether the current state of the transformer for the grid position (i, j) is black, white or blank. Let’s say the input sequence is “D5 E5 F5 G5” (white places on D5, black on E5, white on F5 so E5 turns white, then black on G5). Then the probe p(E5) should return white.

  • That’s exactly what happens: when the transformer isn’t trained, the probes have an error rate of 26%. When the transformer is fully trained, the probes’ error rate goes down to 1.7%.

  • This clearly suggests that the transformer must be getting good at encoding a real world model. Otherwise, the simple probe classifiers simply shouldn’t be able to “extract” from the transformer what the current state is.

  • Now they do something clever: because they know the coefficients of the probe, they can just use it to reverse-engineer which activations the probe (say D5) would have to find in the transformer to conclude that its grid position is, say, white. So they can use that to “write” white into position D5. They can even use that to create an illegal state of the game (a position that could never have been achieved by any legal sequence of moves).

  • All of this just means: yes, the LLM (the transformer) really does seem to encode a world model.


SolidGoldMagikarp (Feb 2023)

(link)

Simple idea: a neural network gets an input, and produces an output. We want to pick a particular output, and then calculate the input that absolutely maximizes the network’s opinion that this is exactly that output. Say you have a neural network image classifier, and you want to see the image that creates the maximization of the network’s output for “goldfish”: we start with a random image, and we use gradient descent on the image to calculate another image that absolutely maximizes the network’s output for “goldfish”. (Turns out it’s a really weird-looking image.)

When you do that for GPT-2, you get the following maximization by output token:

The paper goes on to discover weird tokens that the LLM doesn’t know what to do with. They tend to be close to the origin in the embedding space, but it is unclear why. It’s possible that this hinges on tokens that get extracted from training material fed into the tokenizer model, which then didn’t make it as training material into the LLM.


Interpretability at Scale: Identifying Causal Mechanisms in Alpaca (May 2023, added 5/21/23)

(link)

Great paper that looks at how LLMs can construct internal algorithms to solve a particular task. Other papers have looked at where (which neurons) LLMs store factual knowledge, or how LLMs activate meaning so that you can bias their output more towards a love-y output. But this paper looks at what happens if we’re asking the LLM to implement a simple algorithm, like “say yes if $5.00 is more expensive than $3.00 but cheaper than $9.00”: can we detect where in the model this algorithm gets “implemented”? The paper proposes a clever algorithm that lets you test whether the algorithm the LLM uses internally is one that you expect, or a different one.

The paper uses Alpaca 7B (which is the instruction-tuned chat version of Facebook’s open-source Llama) focuses on this prompt: “Please say yes only if it costs between [X.XX] and [X.XX] dollars, otherwise no.” The output is a single token ‘Yes’ or ‘No’. Here is now the trick: there are many different ways a computer could solve this task. Let’s pick four of them, see below. “Left boundary” first creates a boolean variable that checks if the input price is higher than the left boundary, then combines that boolean variable with the check whether the price is also lower than the right boundary. “Left and right boundary” creates two different boolean variables, one that checks the lower and one that checks the higher boundary, then combines them. “Mid-point distance” calculates the midpoint between the boundaries, then checks if the absolute difference to the actual input price is smaller. “Bracket identity” creates an interval variable and passes that to the output node for evaluation.

So this is how a real computer program would be represented: as flows of information between nodes, where each node creates a temporary variable that gets passed along.

  • Now the paper proposes an algorithm to check whether this is how the model actually does it, called Boundless Distributed Alignment Search. You need to give that algorithm a) an LLM plus b) a causal algorithm model (one of the 4 pictures above) plus c) an input token string.

  • The algorithm then calculates a number called Interchange Intervention Accuracy (IIA) between 0 and 1, which gets calculated for each layer in the LLM, and each token in the input string. If you get a number close to 1.0, it means that the LLM is very likely to actually follow your algorithm in following the task (prompt) you gave it.

  • This is how it looks for our 4 algorithms from above. At the bottom, you see the final tokens from the input query flowing through: “ X.XX dollars. Response: “ At the right, you can see that we’re calculating the IIA only for certain layers. But you can nicely see the model seems to be using the “Left Boundary” algorithm, because that’s where the IIA numbers are highest. You can also see that it probably forms that initial left boundary boolean variable (P is true if X <= Z) as early as layers 5-10, and then it seems to pass that variable along to higher layers. In layer 15, it might be doing the right boundary calculation, and then it turns that into a “yes/no” in the final layer and when it sees the final token. Neat! Clearly “Mid-point Distance” and “Bracket Identity” receive much lower scores, meaning the Boundless Distributed Alignment Search algorithm isn’t able to “detect” in the LLM that it’s implementing those algorithms anywhere.

Another good insight: does any of this change when you change the input prompt?

  • It doesn’t change at all when you change the input prices (like “price between $3 and $5” to “price between $1 and $9”) - that’s a good sign that we’re finding something “real” and that the model always solves this problem in the same way.

  • It doesn’t change much either if you insert irrelevant context: like, if you prepend a crazy unrelated sentence to the input prompt, then the LLM’s performance goes down somewhat on this task, but it still does the same things in the same places.


Evidence of Meaning in Language Models Trained on Programs (May 2023, added 5/22/23)

(link)

This is a great paper. One of the most basic questions in LLMs is: do they learn meaning, or just syntax? They have a simple training objective: minimize loss when predicting the next token on lots of text. Is it possible that they learn meaning from form? This paper has a clever way to show that yes, LLMs indeed start to form a semantic of the world that goes beyond simple syntax repetition. This is a similar insight as the Othello paper.

This is how they test this: they focus on a simple task of giving a transformer a map of a small world with a robot in it (an input map), and another map of the same world (an output map), and the LLM has to write a program that gives the robot instructions to go from the input map to the output map. Here is a simple illustration: when the “reference program” runs on the map(s) called “inputs”, then the robot gets moved such that the map(s) called “output” emerge. (The robot can also put down and pick up “markers”.)

Here is how the training works:

  • They use an off-the-shelf transformer that is only going to be trained on this task (no language training etc.).

  • They create synthetic training data in the following way:

    • Write a random program (called “reference program” in the illustration above)

    • For each program, create 5 random “input” maps that all look different

    • Run the reference program on each of the 5 input maps, which creates 5 output maps

    • Then serialize the 5 input maps, the 5 output maps into text strings (just read them left to right, top to bottom), add the reference program, and you have a nice text corpus for training. Do that 1 million times.

  • Then they simply train on minimizing loss on next token prediction on these input maps-program-output maps tuples. Note again: only training on next token prediction, there is no other intelligence here in the training! The LLM is never taught any kind of semantics outright.

  • What will this transformer be capable of doing at training end? Simple: you give it 5 input and 5 output maps, and it will write the program for you that moves the robot just so that each of the 5 input maps becomes the corresponding output map.

  • Also, during training, they store a “trace” dataset as follows: they simply take all the current weights values in the transformer and store them. They do that every 2,000 training steps, so they have the “evolution” of the transformer weights stored.

    • Importantly, they do that while they’re running tokens through the LLM: and remember, the transformer is writing robot movement programs for us - so after it generated the tokens that constitute, say, half of a program, it means that the robot will have moved halfway through an input map.

Ok, now how do we know that the LLM is learning more than just being a stochastic parrot? In a number of tests:

  • First, they see if the LLM state at any given point in time somehow encodes the direction the robot is facing in. Remember, the transformer is writing robot movement programs. So halfway through a program generation, we can just look at the tokens of the program the transformer has produced so far, and we know precisely which direction the robot must be facing in (just from checking out the half-written program). Now, can we somehow see that in the transformer activations? (The activations here come from feeding that half-program back into the transformer, because that’s how we’d get the next token in the program - it’s a next-token-prediction machine!)

    • The answer is yes: they train a simple linear probe that takes the “transformer trace” (the list of all its weight activations) and correlates that with the robot direction it knows from training. That probe turns out to be very good at reading out from the transformer state which direction the robot faces in. In other words, if I feed the tokens from half a program into the transformer, it will (a) predict the continuation of the program for me and (b) its inner state will somehow reflect the current direction the robot is facing in. The paper calls this the “semantic content” of the LLM.

    • Not only that: the chart below shows this semantic content (the probability that the linear probe extracts the correct robot direction from the LLM) vs. the LLM’s generative accuracy - meaning, how good the programs are that it creates in connecting input to output maps correctly. Turns out they look just about the same! The better the LLM gets at creating correct programs, the better the internal state of the LLM somehow magically reflects the robot’s facing direction.

  • Ok, but that could just mean that the LLM just represents the meaning of the text it already generated. Does it have a plan? Does it know where the robot needs to go next? To test this, they train another linear probe, and this time they try to use the LLM’s current state (after having generated half of the program tokens, let’s say) to predict the robot direction one or two moves in advance. If you can take the LLM’s current weights, after having generated half a program, and not only can you extract the robot’s current direction but also its direction two moves down the line, then that must mean that the LLM’s state already reflects a forward-looking plan for where to move the robot next.

  • And yes, this also works well, see below. It also tracks with generative accuracy. (Though it’s also interesting that the LLM’s pathfinding algorithm appears to be greedy: it doesn’t plan ahead too much - our ability to see two moves in advance gets worse.)

  • But still, it’s still possible that the LLM itself just deals with syntax, and those probes that we trained on top of it do all the semantic interpretation. To test that, the paper has a clever way of just renaming all the syntax and doing all this again, and it still works. Details in the paper - but this suggests that it really is the LLM that has encoded this “semantic state” in itself, not just the extra probe we trained.

  • Finally, two more pieces of evidence that there is “meaning” happening here: 1) the average length of the programs that the LLM writes is lower than the program length in the training set, meaning the LLM finds more efficient ways to write robot directions; and 2) the LLM’s perplexity on the training data never really converges, i.e., the LLM never stops being “perplexed” by seeing its own training data again, meaning it doesn’t ever get that great at memorizing the exact next token that shows up in the training data.


Explaining Grokking Through Circuit Efficiency

“Grokking” refers to a very particular phenomenon in transformers, when we train a transformer from the ground up on a particular task:

  • When the training data size is too small, transformers seem to simply memorize, but fail to generalize. They become great at regurgitating training data, but when you show them new test data, they’re not particularly good.

  • When the training data size gets big beyond a certain point, transformers often very suddenly “flip” to a generalized solution. Suddenly, they seem to find the actual algorithm that “solves” the target task. They go to being good at unseen test data, because they now have a generalized solution for it.

This paper runs a simple simulation to show that there appear to be three ingredients necessary for grokking behavior to appear. More on the three ingredients below. But first, here is how we train transformers:

  • For a given set of inputs X and labels Y, we change the transformer weights to minimize “cross-entropy loss”, which means we want the transformer to become as good at possible at matching any X to the correct Y.

  • This also means we want the transformer to become as “opinionated” as possible: for a given X, we want it to output very high probabilities for the best-matching target labels Y. (Opinionated = strong probability that it feels it is right)

  • We also commonly use weight decay: this means that we minimize the transformer’s “parameter norm”. This just means that we train so that the absolute sum of all of the transformer’s weights should be as small as possible. The intuition here is the following: if you have lots of large weights in the final transformer, it is quite likely that it simply stored all of the training data in its weights - meaning, it just memorized the training data, which isn’t very useful.

If that is how we train, then what are the three ingredients we need to see “grokking”?

  1. Generalizing circuit: There are two “circuits”. A circuit is the set of weights that the transformer comes to implement during training. It’s kind of another word for “algorithm” - i.e., how this transformer behaves.

    1. Cmem which memorizes training data and thus is good at evaluating training data but is bad at test data

    2. Cgen which generalizes and is good at evaluating both training data and test data.

  2. Efficiency: Cgen is more “efficient” than Cmem. This means that the generalized solution is a) great at predicting the right solution for each input (i.e., minimizing cross-entropy loss) while b) doing that with the smallest weights possible (i.e., minimizing parameter norm).

  3. Slow vs. fast learning: Cgen is learned more slowly than Cmem. So, when we start training, it is simply easier early on for the transformer to first learn the memorized solution.

The paper now does something rather simple: it uses a 1-layer transformer, and it sets up a really simple simulation where it can precisely alter exactly these three ingredients.

  • The simulation’s Cgen circuit is simply a look-up table that produces perfect train and test accuracy. So any token x you give it, it will produce the correct output token y.

  • The simulation’s Cmem circuit is a look-up table that produces perfect train accuracy, but makes confident, incorrect predictions on the test dataset.

  • They then find a clever way to set this up such that it takes longer for the transformer to learn Cgen vs. Cmem (details don’t matter).

  • They also set it up such that the parameter norms Pgen and Pmem (i.e., the absolute sum of the weights) after having learned Cgen and Cmem can be controlled.

And this is exactly when you start to see grokking.

Here is how to interpret these charts:

  1. In this case, all three ingredients are present (Cgen is more efficient, and is learned more slowly). You see that after training just a few steps, the transformer implements Cmem, so it works well for the training dataset (“train loss” is low), but it doesn’t work well for the test dataset (“test loss” is initially high - the transformer doesn’t know what to do with inputs it hasn’t seen in training). But then, as you keep training, the transformer flips over to Cgen, and then the test loss suddenly drops. You can also see parameter norm drop: because Cgen is more efficient (needs smaller weights to solve the problem), and because the transformer now flips over to that, it overall has more efficient weights.

  2. In this case, the generalized solution Cgen is less efficient than the memorized solution Cmem. Because we’re training by optimizing for weight decay, this means that our training will keep searching for the lowest-weight (lowest parameter norm) solution, which will just remain Cmem from the beginning.

  3. In this case, the learning speed for both solutions is set to be the same. This just means that Cgen gets learned right away, and no grokking occurs, the transformer just “groks” immediately.

That’s it! This just confirms the intuition that you indeed need these three ingredients to see grokking. One final observation: sometimes when you keep training, the transformer undergoes “ungrokking”: it forgets the generalized solution it had already found.

  • This can now be explained simply: it must be because we have a new training dataset, and in that new training setting, Cmem is now more efficient than Cgen.

  • And so we predict that with enough further training gradient descent will reallocate weight from Cgen to Cmem , leading to a transition from high test performance to low test performance.