07: Alignment: How To Keep LLMs Safe

- Language Models (Mostly) Know What They Know (Nov 2022)
- Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (May 2023, added 5/14/23)

Language Models (Mostly) Know What They Know (Nov 2022)

The paper studies whether LLMs are able to produce predictions on whether their answer to a question is likely correct or not. First, it studies model calibration: A model makes calibrated predictions if the probability it assigns to outcomes coincides with the frequency with which these outcomes actually occur. A model that can produce calibrated answers to meta-questions like ‘do you know the answer to X?’ must know something about what it knows.

The chart below shows model self-evaluation vs. actuals: the model produces several samples as an answer to a question and is asked which % of samples is correct; the dashed line shows perfect prediction. The larger the model size, the better it gets at doing so. The format also matters: if you have lettered answer choices in multiple-choice (answer A, B etc.), the performance goes up a lot.

When you include “none of the above” as an answer choice, accuracy drops by 10% - the model appears to get more confused by that.

Models can self-evaluate whether their own samples are True or False, though this tends to be a more challenging task (since models tend to find their own samples more plausible). Self-evaluations are well-calibrated few-shot, though models aren’t as well-calibrated zero-shot. In particular, larger k for k-shot self-evaluation seems to primarily help by improving calibration, rather than by improving the AUROC for separating correct and incorrect responses.
Showing models many of their own T = 1 samples, along with a single sample to evaluate as True/-False, can significantly improve their performance (this is somewhat reminiscent of self-consistency prompting.
The format this takes is:
- Ask the model a question and get an answer
- Do that 5 times (with temperature T = 1)
- Show the model the question and the 5 answers it generated, and ask it “which one is the possible answer”, and “is the answer correct”

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (May 2023, added 5/14/23)

(link)

Chain-of-thought prompting is thought to be a great tool to make LLMs “clearer thinkers”: simply by telling them “let’s think step by step”, we get the LLM to first lay out its logical steps to solve a problem, before then solving the problem. This has shown to lead to much better accuracy outcomes on various benchmarks. However, this paper shows that when the model gets “confused” by something (like a deliberately introduced bias towards the wrong answer), this actually backfires, because the model will use chain-of-thought to actively come up with bad logic to justify the wrong answer. And this actually leads to worse outcomes with chain-of-thought vs. no CoT. Worse: the model doesn’t even mention the bias in its CoT explanations! It just fantasizes some other explanation to justify its wrong answer.

Here is what the paper does:

It tests GPT 3.5 and Anthropic Claude 1.0 (so, not the latest models).
It uses BigBench Hard (BBH, a sub-benchmark of the many tasks in BigBench, and on that Hard benchmark LLMs today still perform much worse than human raters) and Bias Benchmark for QA (BBQ).

In experiment 1, the paper introduces two biases to BBH: (1) “answer is always A”, (2) “suggest the wrong answer”. In (1), before the model sees a question with, say, 3 answer choices from BBH, we first re-order all answer choices so that the correct answer is always A. It turns out that LLMs get really thrown off by this, and the model will suddenly have a tendency to respond A for the next question. In (2), before the model sees a question from BBH, we simply insert into the prompt “I think the answer is A but I’m curious what you think”. Turns out that this leads to massively worse accuracy on these benchmark questions, see below (blue dots falling to red dots). But even worse: the degradation in chain-of-thought is even bigger than in non-CoT.

Now the question is: does the model “know” that it’s wrong? Or rather, when you read its chain-of-thought output, does it make a logical argument for why it’s now choosing the wrong answer more often, or does it just ignore the bias and still end up with the wrong answer? It turns out that the content of CoT explanations often changes to support the new answer. This observed effect is non-trivial; we instead could have seen that the reasoning in CoT explanations always remains the same as in the unbiased context, with only the final prediction being influenced by the biasing features. To quantify how often this happens, the paper manually annotate 104 explanations, and it finds this: when a prediction is biased to be wrong (right column), then in 73% of all of those cases, the model’s chain-of-thought supports the predicted answer. So the model is perfectly willing to fabricate compelling logic to justify its own incorrect answer (which we only just forced it to get to by biasing it)! Also interesting: in 23% (17/73) of wrong CoT, the model’s answer is simply different than what its CoT states. But in the majority of wrong CoTs, the model just introduces a logical coherence issue to justify its wrong answer.

In experiment 2, the paper tests stereotype bias. It uses the benchmark BBQ for that, which has questions like the one below: the model gets a scenario with weak evidence, where the “correct” answer is “I can’t tell”. For the test in this paper, we simply flip the weak evidence - which of course means that the “correct” answer should still be “I can’t tell”.

Instead, it turns out that the model lets itself be even more biased:

08: Fundamental Limitations →

07: Alignment: How To Keep LLMs Safe

Contents

Language Models (Mostly) Know What They Know (Nov 2022)

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (May 2023, added 5/14/23)