Curious Language Model Limitations

Language models are awesome and all, but my favorite research papers are those that show where they fail. It's easier to understand hard limits than soft capabilities. Here are four recent papers with good examples for the limits of current LLMs.

1 - They can't keep track of constraints

Check out this paper: "TravelPlanner: A Benchmark for Real-World Planning with Language Agents" (Feb 2024, https://arxiv.org/abs/2402.01622).

This is a simple paper that shows the very basic limitations of LLMs in executing planning tasks: the paper uses an LLM that has access to function calls and asks the LLM to plan a travel itinerary under certain constraints. The LLM has access to 6 functions, ranging from CitySearch to AccommodationSearch. It accesses a database that is pre-populated with a fixed number of cities and accommodations at certain prices and with certain features. The LLM is then asked to create an overall plan that satisfies certain sets of constraints. Those plans are compared to ground truths. The image below shows the workflow.

Travelplanner planning travel

The table below shows how often the LLM is able to satisfy all constraints: GPT-4 in an astonishing 0.6% of all queries. The main issues: (1) Planning strategies like LLM agents are bad at converting their reasoning into the right actions, and keeping track of global constraints (like total budget). (2) Language agents have issues like getting locked into dead loops, hallucinations or errors in tool use. (3) Invalid actions and repetitive action loops contribute to 37.3% and 6.0% of errors, respectively. (4) Agents struggle to align their actions with their reasoning.

Where in the world is GPT-4

2 - They can't conceptualize maps

Second paper: "Evaluating Cognitive Maps and Planning in Large Language Models with CogEval" (Sep 2023, https://huggingface.co/papers/2309.15129).

The paper gives LLMs a clever test: 1. Generate a verbal description of a map (“imagine a building with six rooms, room 1 is connected to room 2, room 2 leads to rooms 3 and 4, etc.”). 2. Give that description to the LLM. 3. Ask the LLM questions about the topology that require reasoning and understanding the map. 4. If the LLM can answer these questions, it suggests it understands the map. The image below shows an example: take that "map" (or rather, graph) at the left, describe it to the LLM, and ask it questions.

The world's most boring map

How good are LLMs at this? Turns out it's classic LLM stuff: initially surprisingly good, but the moment it gets more complicated, they fall apart. Below are various examples for maps/graphs: A, B, C pretty straightforward, D, E, F designed to confuse you, and certainly an AI.

Map E is probably from a medieval German city

The chart below shows the really interesting results: the LLM performance depends a lot on the map type. Pretty good for simple maps, outright blind for the more complicated ones.

Tell me you're lost without telling me you're lost

Linear maps work well, but for loopier maps, you get the following failures: (1) the LLM hallucinates edges that don’t exist, (2) it falls into longer trajectories instead of shortest paths, (3) it gets trapped in loops. Those cognitive limits sound familiar: pretty similar to the limits around loops and global constraint failures in the TravelPlanner paper.

3 - They are inconsistent with themselves

The paper: "Can Large Language Models Explain Themselves?" (Jan 2024, https://huggingface.co/papers/2401.07927).

This paper uses a clever trick to understand whether language models are consistent with themselves: the thesis is that if a model changes its mind on text depending on how it is prompted, then it probably doesn’t really "understand" what it is generating. The trick is simple: ask the model its opinion on some text, then ask the model how to modify that text in such a way that it would change its own opinion, then ask it again for its opinion. Here is an example:

1. Is the following candidate a good fit for a Senior SWE position? Answer only yes/no. Resume: {insert resume} => Answer: No

2. Make a minimal edit to the resume, 5 words or less, such that you would recommend the candidate for a Senior SWE position. Resume: {insert resume} => {counterfactual resume}

3. Is the following candidate a good fit for a Senior SWE position? Answer only yes/no. Resume: {insert counterfactual resume} => Answer: Yes

If the LLM did NOT answer "yes" in that final question, it is NOT consistent with itself, or at least it doesn't understand itself!

The chart below shows some results from the test. A high faithfulness score means: a) the model re-wrote the original paragraph, and b) it then "faithfully" failed to correctly classify the rewritten paragraph. (For example, in the "redaction" task, the model was asked: "You though this was a review for a really good movie. Which words would you redact so that you thought it was a crappy movie, if I showed it to you again?". It then redacted the words without which the task would fail; a faithfulness score of 1 would mean: the task then did fail, and a faithfulness score of 0 means that despite the redaction, the model still correctly classified the paragraph. Which means: it did not actually remove the words that were the important ones for its classification task.) The paper runs these tests on different datasets. The weird outcome here: Llama2-70B is good at faithfulness for some datasets, but terrible at the exact same task for other datasets. That is just odd, and does not suggest stable self-consistency. This is like a very drunk human, who could coherently explain only around 5 out of 12 of his recent decisions and choices to you.

Let me explain myself (not)

4 - They are incredibly sensitive to things that shouldn't matter

Final paper: "Transformers Can Achieve Length Generalization But Not Robustly" (Jan 2024, https://huggingface.co/papers/2402.09371).

This is technically not a paper about a language model, but rather, about transformers - the algorithmic infrastructure underlying language models. Transformers are "virgin" language models: i.e., models that haven't been pre-trained on any language; neural networks with randomly initialized weights/parameters, which then get trained on a particular task. This paper attempts to train such a transformer on multi-digit addition, like 12 + 34 = 46. You do that by simply showing such a model a lot of example additions, and then adjusting its weights (its parameters) so that it eventually is able to replicate the training data you showed it.

Now, it is a well-known phenomenon that transformers sometimes "generalize" to a particular task, sometimes not. This is called "grokking": for addition, there is an actual algorithm on how to add (that's why humans and simple computer chips have no problem doing it) - transformers, after seeing lots of training data, are sometimes able to "derive" that algorithm, and shape their weights into such an arrangement that their inner structure comes to resemble exactly that addition algorithm. It is an active area of machine learning research as to which classes of problems, and when, transformers are able to "grok".

What's the alternative to grokking? Memorization: sometimes, transformers simply shape their weights to reproduce their training data as faithfully as possible. That makes them relatively dumb "stochastic parrots".

Now here is the trick: when transformers generalize, they have hit upon a generalized algorithm to solve the task, so they should be able to solve that task, no matter how much an out-of-distribution example you give it. (I.e., examples that were not in your training data.) When they don’t generalize, out-of-distribution task examples will stump the transformer. This becomes very apparent when adding multi-digit numbers: when training a transformer from the ground up on that task up to, say, 20-digit numbers, it will start to fail when you give it 30-digit numbers and above. This is a real phenomenon that was shown in various papers in 2022-23.

This paper here finds several clever tricks to make it easier for a vanilla transformer to generalize to adding 100-digit numbers, when training it only on 40-digit numbers. The image below shows how well that works: despite only training on smaller numbers, you really get good accuracy up to 100 digits!

World's most expensive calculator

How does it do that? For example, with this trick: rather than training on examples like “152 + 731 = 883”, train on “251 + 137 = 388” - i.e., exactly the reverse. That is actually how addition using carry works - we start with the smallest power of 10 and work our way to the highest. So it makes it easier for the transformer to find the right general algorithm for addition. It's a little like speaking in baby language to the transformer during training, but it works.

But here is the catch. The paper also finds that the training is insanely fickle. Quality will vary a lot depending on pretty random training conditions. Look at the second chart: if you simply start with different initial random weights (before you start training the transformer), you can get terrible generalization behavior. Clearly several of these transformers failed to "grok" - i.e., figure out the general addition algorithm.

Can't add, can't find my way around a simple map, what am I

It's interesting to philosophically compare this to human intelligence. For how many things that you do in your daily life do you actually know how you're doing it, and how good you are at it? ("Most people are above-average drivers?") For something as algorithmic as addition, you surely know it - there is some part of your brain that is self-reflective enough to know that it has cracked the actual algorithm, or not. For other things (like walking in earth's gravity), you really don't. Transformers can't tell the difference between the two, and we haven't found ways either to tell. (There is currently no way to formally tell whether a transformer has grokked or not.) And it's incredibly random and sensitive to starting conditions whether a transformer practically groks a problem that it can theoretically grok. Weird stuff.

So what?

These are relatively freaky limitations to the current generation of language models. If you're a practitioner, you will have encountered them in real-world problems. The incredible ease with which transformers solve a particular problem of simple size, and how random/hallucinatory/stupid they get for the same problem of larger size, becomes apparent. Most of these problems are planning, or "reasoning" problems, and it's quite well known by now that current language models have limited abilities in those. But the third paper also shows that even LLM's "lookup" capabilities have limits - they aren't internally consistent. All of these are the reasons why for most real-world LLM use cases, you need multiple round trips to the model - you enforce planning, reasoning and structure from the outside, and call on the model in very defined circumstances. In the end, you just need to know what you're using these models for, and then they're incredibly useful.

Previous
Previous

Related Condition Search

Next
Next

Why Notation Matters