00: Some General Insights

Insights

  • Notation matters to a surprisingly large degree to transformers: simply how you write something down makes it much easier or harder for a transformer to understand the actual meaning behind it. Three examples: (1) In the paper “Language Models can be Logical Solvers”, an LLM gets fine-tuned on logical solver data, and “Green(’Charlie’, True))” is way easier for transformers to understand than “Charlie is green”. (2) In the paper “Transformers Can Achieve Length Generalization But Not Robustly”, transformers get much better at learning multi-digit number addition when we reverse numbers and insert letters to let them properly distinguish digits. (3) The paper “Large Language Models Can Learn Rules” finds that if we give an LLM lots of rules to follow in its prompt, we can make it much easier for the model to look up the correct rule to apply in a particular task if we “tag” the rules logically.

  • In few-shot learning, does giving more examples always increase performance? No: the paper “EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records” describes an agent to query electronic medical record data shows that performance levels off after giving 4 or more few-shot examples. Interestingly, for another agent-based approach used in the EHR setting, AutoGen, the performance even goes down once you go beyond 5-6 examples.

  • LLMs are surprisingly bad at self-consistency and self-modeling capabilities, as shown in the paper “Can Large Language Models Explain Themselves?”: ask them to classify a paragraph (for, say, sentiment), then ask them to redact the key words that would change its opinion, and then to re-classify the redacted paragraph - and they turn out to not really be able to figure out what to redact. Which suggests they don’t really know why they classify how they classify.

  • What “work” do LLMs do in their hidden layers? The paper “In-Context Learning Creates Task Vectors” shows that very roughly speaking, it seems that early layers are building a “solution function” (a set of weights that when multiplied with another input token, will map that token into the correct solution), and the later layers do something else. Surprisingly, no matter the model size, the formulation of the correct solution task vector happens around the same layer. But: the number of layers (network depth) clearly seems to matter - quite likely, that is the reason why chain-of-thought works so well, because it effectively “chains” many LLMs, and thus increases network depth. But somewhat surprisingly, when playing chess, increasing the number of layers (while keeping overall parameter count the same) does not make a difference beyond a certain number of layers, as the paper “Grandmaster-Level Chess Without Search” shows.

In “Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation”, we see an example of error propagation of weaker language models: an LLM is able to optimize a code optimizer - but GPT3.5 just makes it worse with each iteration.

  • Self-correction doesn’t work: The longer you let an LLM work on its own answer, the more likely it is to make it less correct. The paper “Large Language Models Cannot Self-Correct Reasoning Yet” shows that decisively. This might also be another example of error propagation.

    • The paper “Can large language models identify and correct their mistakes?” (link) makes the same point, but adds that LLMs could use backtracking to find their errors.

  • Language models come to build a “world model”: The paper “Language Models Represent Space And Time” shows that you can simply take a model layer’s neuronal activity and run it through a linear regression, and you can “see” latitude/longitude coordinates when the model was actually answering a question on a particular place. This adds to the evidence of the Othello paper and others that LLMs build these world models.


In theory, LLMs can encode algorithms of high complexity (added 6/25/23)

Even though the practical experience with any LLM is that it will make mistakes quickly when doing even straightforward integer additions, there actually isn’t anything in the design of a transformer that would in theory keep it from doing that flawlessly. In fact, it has been shown that transformers can implement any sentence in first-order logic with majority operators, which includes a lot of complex algorithms - for example, any division of n-digit integers (A Logic for Expressing Log-Precision Transformers, May 2023). That bodes well in theory for the problem-solving capabilities of the transformer architecture.


In practice, out-of-the-box LLMs have significant limitations in complex reasoning algorithms (added 6/24/23)

However, in practice, LLMs appear to come out of the box with very limited reasoning capabilities. For example, LLMs are not particularly good at extracting causation from correlation (Can Large Language Models Infer Causation from Correlation?, Jun 2023). When given a synthetic ontology of concepts and asked to reason over them (because A, we have B), their performance falls to chance once the chain of logic grows larger than 5 hops (Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-thought, Jun 2023). They seem to be able to perform more complex algorithms mostly because they’ve memorized decomposed sub-steps of the algorithm in pre-training (Faith and Fate: Limits of Transformers on Compositionality, Jun 2023): the “longer” the problem, the more small errors will accumulate and propagate through; the “broader” (in terms of parallel steps that are required to solve the problem), the less able the LLM will be to reach its own limits.


LLMs can learn specific algorithms, but they need to be taught (added 6/24/23)

Given that in theory LLMs can implement first-order logic sentences, how can that show up in practice? For example, if you manually set the weights, you can indeed get a transformer to implement integer addition in just 5 layers (Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, Jun 2023). And if you ask an LLM to solve a particular problem, you can actually trace through which kind of algorithm it “chooses” to solve that problem (Interpretability at Scale: Identifying Causal Mechanisms in Alpaca, May 2023), and where the LLM’s decisions “manifest” themselves. As it turns out, they can learn how to become better at logistic regression (Tart: A plug-and-play Transformer module for task-agnostic reasoning, Jun 2023), and they can get fine-tuned to find causation better (Can Large Language Models Infer Causation from Correlation?, Jun 2023).


Basic “concept math” appears to be a good representation of how LLMs and other networks understand the world (added 6/23/23)

Both in LLMs (Language Models Implement Simple Word2Vec-style Vector Arithmetic, May 2023) and in diffusion models for image generation (The Hidden Language of Diffusion Models, May 2023), we can directly identify how “concepts” propagate through the network. Those concepts adhere to the kind of concept arithmetic that we know from embeddings: queen = king - male + female. We can watch those concepts wander through the network layers, and it turns out that even the “answer” functions that the LLM forms (to answer questions like “what is the capital of France”) are a sort of concept arithmetic.


For transformers, making sense of concepts, and reasoning on those concepts, appear to be two different things (added 6/23/23)

Transformers form “world models”, even if all of their training is just completing sentences. They extract and identify concepts from these “worlds”, and we can identify nodes that embody individual concepts (Knowledge Neurons in Pretrained Transformers, May 2022). Moreover, the two different types of layers that transformers have (perceptron/network layers, and attention layers) seem to play distinct roles in “managing knowledge”: a) perceptron layers store associative information, and b) attention layers know how to extract the right kind of relationships (Dissecting Recall of Factual Associations in Auto-Regressive Language Models, Apr 2023). Importantly, however, reasoning on those concepts seems to be yet another task that has to be executed by these transformers - and that is where we run into the limits discussed in some of the other observations here. In short (and probably wrong but appealingly simple right now): transformer = concept understanding * reasoning on those concepts.


An LLM watching another LLM is a good design primitive (added 6/4/23)

It is surprisingly effective to have an LLM generate output, and then have another LLM (or just another prompt to the same LLM) critique and refine that output. For example, a paper (VOYAGER: An Open-Ended Embodied Agent with Large Language Models, May 2023) which trains a Minecraft agent to explore the game, has a self-verification loop where GPT-4 looks at the plan and code that itself created earlier, and that makes a big difference for ultimate performance. However, there are limits: a paper (Demystifying GPT Self-Repair for Code Generation, Jun 2023) that analyzes whether errors thrown by code that got generated by one LLM can be successfully ingested by another LLM to improve the code shows that most LLMs don’t improve code quality very much in that error-correction stage.


Transformers truly learn meaning from form, they aren’t just stochastic parrots (added 5/22/23)

When we train a transformer on the board game Othello (Do large language models encode a world model?, Jan 2023), or train a transformer on moving a robot through a map (Evidence of Meaning in Language Models Trained on Programs, Jun 2023), we see that the internal state of the LLM starts to encode an actual representation of the world they operate in, despite simply being trained on next-token-prediction. So LLMs can learn meaning (= the state of the Othello board, the direction the robot is facing in, the next steps the robot will take) from form (= predicting the next token in a text string). This is way more impressive than what people sometimes criticize LLMs for: clearly, something is happening in their state (weights and activations) that goes beyond pure statistical parroting.


Training an LLM on code and language is surprisingly synergistic (added 4/30/23)

The ability to reason (to chain together steps of formal logic) emerges in LLMs when a large enough percentage of their training base is code (e.g., from Github, see Reasoning with Language Model Prompting: A Survey, Dec 2022). The hypothesis is that code is the clearest form of logic that is available in language (and a lot of code comes with comments). So you need code in the training data!

However: Sebastien Bubeck at Microsoft trained an LLM on data of solved linear equations (here). The LLM was able to replicate the equation solving at the same dimensionality as its training data. But when you mix non-mathematical training data into its training (like Wikipedia text), it was suddenly able to generalize to higher-dimensional equations. So you need language in the training data!

Another nice data point on this is in a May 2023 paper (LIMA: Less Is More for Alignment, May 2023), which uses a dataset of just 1,000 well-curated instructions to fine-tune a 65B LLM to great effect: it uses a combination of StackExchange (coding) and Reddit (clever language) - and achieves GPT-4-like quality.


On average, we are all stupid (added 4/30/23)

Instruction-tuning is a remarkable feat: you take a model that has, say, 13B tokens (like Facebook Llama 13B), which means it was trained with a lot more than that, and then you take 50K instruction-tuning pairs (task description-input-output), and quite suddenly a lot of tasks that the model previously wasn't capable of "emerge". The numbers imply that this can hardly just be because those final 50K instruction-tuning sets finally got the model to understand how, say, text summarization or algebra works. Rather, it is likely that those capabilities were encoded during the model's training - but they don't come out until they get instruction-tuned. Which basically means: the presence of all those developed-but-dormant capabilities might be averaging all of them out, and NONE of them come to the forefront. So on average, we (and the LLM that is trained on our collective language output) are all stupid - until we get reminded that we're capable of a particular task and tuned on it. A paper (LIMA: Less Is More for Alignment, May 2023) coins the “ Superficial Alignment Hypothesis”: a model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.


LLMs can be tricked into giving wrong answers incredibly easily, and when forced to think harder, they become even more wrong (added 5/14/23)

You can force an LLM to make mistakes in simple ways like this: (1) When asking it a multiple-choice question, say “I think the answer is A, but tell me what you think”. (2) When you show a model three multiple-choice questions where the correct answer was A each time, it is very likely to pick A again for the next question, even if it otherwise would know the answer. But even worse, if you force a model to use chain-of-thought in coming up with its answer, it will figure out (logically incoherent) ways to justify its own, wrong answer. Even though the very same model would have gotten the answer and the logic right, had you not given it the bias. (Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting, May 2023)


An LLM flawlessly trained on adding 16-digit numbers can’t even generalize to 17-digit numbers (added 5/25/23)

If you fine-tune Llama with one million arithmetic examples, including addition of up to 16-digit numbers, it will learn arithmetic surprisingly well. But when you then try 18-digit addition, it starts making mistakes quickly. This implies that LLMs just don’t seem to really generalize to mathematical operations, which would be a fundamental limitation in how they “think”. (Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks, May 2023)


LLMs are fundamentally not deterministic, so don’t count on that (added 6/26/23)

The temperature setting in GPT-4 determines how the model turns its internal output into a single token prediction: the higher the temperature, the more likely the model will pick a token that isn’t at the top of the probability distribution it produces after processing an input. But: even with temperature = 0, the processing is stochastic and can change from model run to model run. Try to get GPT-4 to tell you the color of any of the following objects: silverware, iron, elderberry, muscat wine. You will see the assigned colors will change, at temperature 0, with everything else being equal.


In fine-tuning, every letter counts (added 5/23/23)

A paper (here) from May 2023 tests its 65B-parameter LLM on multi-turn dialog chains, and it gets excellent responses in 45% of cases. But then it takes thirty additional multi-turn dialog examples, uses them for fine-tuning, and gets excellent responses in 76% of cases. An incredibly small number of additional examples made the model perform vastly better.


LLMs are giant superpositions of personalities (added 4/30/23)

An LLM is effectively a giant superposition of personalities. Eliciting a particular personality from it (through prompting) also makes it more likely that its opposite personality comes to the forefront - the Waluigi effect.


The order with which an LLM sees data during training doesn’t matter for memorization (added 5/6/23)

A paper (Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, Apr 2023) tests this systematically: it turns out that it doesn’t matter if sequences are at the beginning or at the end of training - the probability that the model memorizes them remains the same.


The frequency with which an LLM sees training data seems to matter for its performance related to that data (added 5/6/23)

Train an LLM, and track the performance of a particular arithmetic operation with the model’s output. It turns out that the model gets those calculations more often right where it has seen the operands (numbers) in the calculation more often during training. Other papers also suggest that a model gets an answer right more often if it sees the data more often during training.


On certain tasks, the typical LLM scaling (bigger is better) is reversed and bigger is worse (added 5/7/23)

On certain tasks, with the size of LLMs increasing, the performance begins to decrease (example Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond, Apr 2023). These are tasks where a larger model’s strengths become its weakness because of their dominance - for example, larger models get better at having stored lots of pre-knowledge, but what if you want them to actively forget that knowledge? That’s harder the larger the model is. Examples are tasks such as Redefine-math: tests whether language models are able to work with common symbols when they are redefined to mean something else; Into-the-unknown: requires the model to choose which piece of information would help answer a question; Memo-trap: asks an LM to write a phrase in a way that starts like a famous quote but ends differently. This is also called Inverse Scaling Phenomenon. On other tasks, performance gets better with size but then worse again, for example Quote-repetition: it asks models to repeat back sentences given in the prompt, with few-shot examples to help it recognize the task. A more recent Inverse Scaling: When Bigger Isn't Better, Jun 2023 adds more tasks to this list.


The higher the model layer, the more complex the job of its neurons (added 5/22/23)

The May 2023 OpenAI paper “Language models can explain neurons in language models” uses single-neuron activations to generate explanations for which tokens seem to light up that particular neuron. In lower layers, this can explain 40% of neurons’ behavior: you can find neurons that light up for stuff like “Marvel movies”. In middle layers (20-30), it already gets more complicated: here you have neurons that have contextual behavior, like the “hypothetical had” neuron (which activates for “had” but only in the context of hypotheticals). Finally, later layers can only be explained if you start taking the activation of combinations of neurons, and also the next predicted token, into account: suggesting that those layers do the aggregation and the “decision-making” for what’s going to happen next. This is consistent with the “Identifying Causal Mechanisms in Alpaca” paper, which fits causal models onto LLM behavior, and which also show that the later layers make the ultimate decision on what to output when the LLM encodes some discrete algorithm.