13: Systems Design: How to Make LLMs Part of a Feedback Loop

Knowledge Fusion of Large Language Models

(Jan 2024)

“Ensembling” is a common practice used in machine learning: by using the output from several models trained to solve a particular task at the same time, you usually get a performance improvement. This paper shows that this is also the case for LLMs. The question is, how do you actually fuse together several source LLMs? The paper introduces a new methodology and shows that it is superior to others. It tests the methodology against 1) knowledge distillation and 2) ensembling.

First, this is how it works: they use 3 source LLMs (Llama-2 7B, OpenLlama, MPT), and they fuse those together using their methodology (which is some kind of fine-tuning but with a different loss minimization function). Below they show that fusing additional LLMs results in additional performance gain on three benchmarks (BBH = Big Bench Hard): FuseLLM fuses all three LLMs.

Comparing to two other methodologies:

Knowledge distillation: Use a larger LLM (e.g., Llama-2 13B), generate output for a particular task, and then use that output to fine-tune a smaller LLM (e.g., Llama-2 7B). Here is the comparison on three different benchmarks (BBH = BigBench Hard). Llama-2 KD is the knowledge-distilled model as just described (7B parameters), and FuseLLM is their methodology, which performs best.

Ensemble: Use three datasets (PhilPapers, NIH ExPorter, USPTO Backgrounds) from The Pile and use 1 billion tokens from each domain to continually train Pythia 1B. That results in three distinct LLMs with identical weight structures. Then we use each LLM to predict next tokens for sample text and measure each LLM’s perplexity (the probability that the LLM is “surprised” by the next predicted token).

Ensemble method: calculates a weighted average of each of the three LLM’s token prediction, using the performance of each model as weight
Weight merging method: create a new LLM by weighted averaging the three LLMs’ weights, again using the performance of each model, to create one singular new LLM
FuseLLM: use the method in the paper by undergoing continual training on an additional 0.1B tokens sampled from the three domains.

Here is the comparison for perplexity: overall, their methodology performs best (albeit not for the individual datasets, so it loses some specialization).

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding

(Feb 2024, link)

This introduces a prompting methodology for LLMs which breaks down a problem into smaller problems, then has the LLM act as “experts” to solve those. It seems to work:

This example illustrates the methodology well:

The user asks the first question.
The LLM is prompted to break down the problem into individual steps.
For each step, the LLM itself generates prompts to itself to act as an individual expert (here: the “expert chess player” to produce the next chess move, and the “expert chess analyst” to validate the proposed move).
The overall LLM summarizes and decides when it is done with the answer.

Each expert only sees what the Meta Model chooses to share with them, and responds accordingly. For instance, if a problem pertains to mathematics and history, the Meta Model might consult a mathematics expert for a calculation and a history expert for historical context. The output of the expert is extracted and additional instructions are appended.

Here is a nice overview of which experts the LLM conjures up, depending on the problem:

But here is another version of this - this time, if the LLM does not have access to a Python interpreter. For “word sorting” it invokes an expert linguist, proofreader and essayist, instead of just calling Python code.

K-Level Reasoning with Large Language Models

(Feb 2024, link)

This is a simple idea that deals with a very specific setup: what if you need an LLM to solve a dynamic problem, where others are making decisions at the same time? Unlike a static problem, you need to incorporate the others’ decision-making in your own decision. An example is: if there are two routes from A to B, and route A is shorter than route B, then every agent would decide to take route A - but an agent that incorporates other agents’ decision-making might say, I should take route B because the other agents are going to jam up route A. The paper tests two such scenarios:

The average game: each agent guesses a number from 1 to 100, and the agent that guesses closest to 0.8 times the average of all guesses wins the game.
The survival auction: each agent bids each day in an auction to buy water supply needed to survive.

Here is a simple illustration of k-level thinking: instead of just incorporating one current decision, incorporate the expected decisions of others, up to k levels deep.

Below is the illustration how to do it: just run a decision-making algorithm several times, then incorporate the results into the prompt.

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

(Aug 2023, link)

There are many algorithmic problems that an LLM can’t solve in one go, but which it could solve we used some other mechanism to decompose the problem first. For example, an LLM can sort 16 numbers, but not 64 numbers in a list. So instead, break down the list of 64 items into 4 lists of 16 items, call the LLM to sort each one, then call the LLM to merge the results. This paper proposes this kind of method, laid out in the form of an execution or orchestration graph. Here is a comparison to other methods: the tree-of-thought method simply leaves it up to each LLM invocation to spawn new call chains; the graph-of-thought method can also merge those chains back together. But in the end, all we’re talking about here is that the LLM gets called multiple times in a row, with some logic around it to glue the results back together.

Here is a good example, for the sorting of the list: first, split up the list into 4 sub-lists. Then sort each sub-list. Then, aggregate the results. Then, have the LLM error-check the results. All steps could be run multiple times in parallel, and you could pick the best results. The call tree (using the keywords Generate, Repeat, Aggregate, Improve) gets specified initially. It does not seem as if the paper gets this call tree from an initial call to the LLM - instead, each problem’s tree is manually written. That’s pretty lame, and worse than the usual agent architectures.

Here is the graph-of-operations for sorting 64 numbers. Very straightforward, and you see the 4 actions at work in the bottom right.

This is how much this method improves on other methods. For smaller lists, you really don’t need this approach (32 elements can be sorted just as well by tree-of-thought). For larger lists, not surprisingly, this decomposition works. Also, graph-of-thought has lower costs than tree-of-thought, probably because it can be more efficient in merging graph paths back together, rather than having tree paths spread out forever.

GPT-Calls: Enhancing Call Segmentation and Tagging by Generating Synthetic Conversations via Large Language Models (Jun 2023, added 6/19/23)

(link)

The paper proposes an algorithm to get a time series of probabilities of which topics are being discussed when on a phone call. It is a neat combination of a) using an LLM to generate example text and b) using embeddings to identify that text live. This kind of generate-identify loop is a clever way to speed up real-time inference, and to provide time series structure to a text. This is what we want: we have a call of 70 seconds, we give the model various topics we want to track across time, and the model will give us time series of probabilities of each topic over time.

Below is how it works. There are two phases: an offline phase (where text gets generated) and an online phase (where embeddings get looked up).

In more detail:

Offline phase:
- Identify the typical topics that come up on the kinds of calls you want to track (above: schedule, pricing, identification, greetings, ending).
- Use GPT-3 to generate thousands of examples of “utterances” that would come up in each type of topic (like “That’s a steep price, do you have anything cheaper” for “pricing”).
- Calculate the embedding for each utterance.
- Run the DBSCAN algorithm on the sentence embeddings of each topic, in order to extract a set of multiple “anchors” representing the distribution of the topic. DBSCAN is a density-based clustering algorithm that groups data into clusters based on the density of samples. High-density regions are grouped into a cluster, and samples in low-density areas are marked as outliers. For each topic, DBSCAN is applied to retrieve a set of clusters. The center of each cluster is extracted and used as an anchor. These anchors will be used during the online phase to infer the topic probabilities for each utterance in the call.
Online phase:
- GPT-Calls operates on the transcriptions of the recorded conversations and predicts the topic probabilities for each utterance in a given conversation. An utterance is an atomic unit of speech, which is mostly converted to a single sentence or a sub-sentence by the transcription model. First, embed the resulting transcripted utterances.
- Iteratae over the embedding of each transcripted utterance, scoring its similarity with all anchors of the pre-defined topics. By performing this process, one obtains a sequence of vectors, where each vector represents the probability that the corresponding utterance relates to each of the topics.
- To improve the accuracy of the topic probabilities, identifies the peak points in each dimension of the time series, referred to as “heat sources”. Apply a heat diffusion strategy to the neighboring samples surrounding each heat source. The probabilities of samples that are close to other samples that highly correlate with a specific topic will be slightly promoted toward the same topic.

That gives you the time series picture from above.

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training (Jun 2023, added 6/19/23)

(link)

Reinforcement learning with human feedback (RLHF) is a crucial component of getting high-performing models like ChatGPT: after pre-training the language model with tons of data, we then have it generate output, and each output gets rated and ranked by human reviewers. Then we use those signals to feed them back into the language model and further train it. This is the step that made ChatGPT so good at responding human-like. However, the issue is that “this output is better than that” is a very sparse training signal: why is one LLM output better than the other, what are all the underlying reasons? This paper proposes to make that the RLHF process: let humans rate each component of the answer/output, not just the full output.

Below is how it works. First, note that RLHF usually means first training a “reward model” - humans rate a bunch of example LLM outputs, and based on that we train another (simpler) model that the LLM can then start “using itself” (i.e., the model tries to approximate human behavior after seeing enough “good” examples). So the idea here is two-fold: 1) train not just one reward model, but three independent reward models, and 2) giving feedback/training the reward models at the level of sentences.

They then try this out on a long-form Q&A task (for which they generated the dataset). In the charts below, SFT is the initial trained model without RLHF; Preference RLHF is the traditional RLHF with holistic preference-based rewards; SFT-Full is a fine-tuned base model with all human gold responses (which require a lot more annotation initially). Observations:

Their method really improves on “factualness” (incorrect or unverifiable facts in the answer), but it doesn’t improve on “relevancy” (irrelevance, repetition or incoherence in the answer). It also improves on “information completeness”.

Neat to see the trade-offs during training: these are the rewards from each reward model. You get better relevancy early, but the outputs lack facts and completeness. When those reward models “assert themselves” later in training, relevancy drops.

VOYAGER: An Open-Ended Embodied Agent with Large Language Models (May 2023, added 6/4/23)

(link)

Very nice practical application of GPT-4 with clever systems design to steer a player in Minecraft. The core ideas are that a) you prompt GPT-4 to come up with next steps, b) you prompt GPT-4 to write code to manipulate the environment (through Minecraft scripting), and c) whenever the agent gets error-free code that represents some new ability, you store that in a “skill library”. The chart below shows how much better this Minecraft agent performs, relative to AutoGPT, another LLM systems architecture where you have the LLM “call itself” repeatedly. Also, building the “skill library” makes a big difference:

Here is the systems architecture: the idea is that you use GPT-4 to

(1) propose suitable tasks based on its current skill level and world state, e.g., learn to harvest sand and cactus before iron if it finds itself in a desert rather than a forest;
(2) refine skills based on environmental feedback and commit mastered skills to memory for future reuse in similar situations (e.g. fighting zombies is similar to fighting spiders);
(3) continually explore the world and seek out new tasks in a self-driven manner.
This is done with three key modules: 1) an automatic curriculum that maximizes exploration; 2) a skill library for storing and retrieving complex behaviors; and 3) a new iterative prompting mechanism that generates executable code for embodied control.

Let’s review step by step.

Automatic curriculum: We tell GPT-4 everything we know about the environment and the agent and ask it to write/articulate the next step in natural language. The prompt consists of the following data points:
- Directives encouraging diverse behaviors and imposing constraints, such as “My ultimate goal is to discover as many diverse things as possible

... The next task should not be too hard since I may not have the necessary resources or have learned enough skills to complete it yet.”;

The agent’s current state, including inventory, equipment, nearby blocks and entities, biome, time, health and hunger bars, and position;
Previously completed and failed tasks, reflecting the agent’s current exploration progress and capabilities frontier;
Additional context: We also leverage GPT-3.5 to self-ask questions based on the agent’s current state and exploration progress and self-answer questions with a wiki knowledge base to provide additional context to GPT-4.

Skill library: The idea is to ask GPT-4 to write code that works in Minecraft as a scripting language. The prompt includes the following:
- (1) Guidelines for code generation, such as “Your function will be reused for building more complex functions. Therefore, you should make it generic and reusable.”;
- (2) Control primitive APIs, and relevant skills retrieved from the skill library;
- (3) The generated code from the last round, environment feedback, execution errors, and critique, based on which GPT-4 can self-improve;
- (4) The agent’s current state, including inventory, equipment, nearby blocks and entities, biome, time, health and hunger bars, and position;
- (5) Chain-of-thought prompting to do reasoning before code generation.

Iterative prompting mechanism: The simple idea here is to re-prompt GPT-4 with errors from the code execution, and immediate environmental feedback (like ““I cannot make an iron chestplate because I need: 7 more iron ingots”). Also, they use another GPT-4 prompt to critique the prompt that the first GPT-4 came up with. This is a clever kind of “self-verification” loop, see below for examples:

Which of these components makes the biggest difference for performance? Let’s find out - these are great, universal insights.

Without the skill library (the code components for each of the agent’s sub-activities), the agent just stalls out at some point. That’s a cool insight - there is a limit to the complexity of what can be “learned through one-off prompts”.
Huge difference between code generation in GPT-3.5 vs. 4.
In terms of curriculum: The discovered item count drops by 93% if the curriculum is replaced with a random one, because certain tasks may be too challenging if attempted out of order. On the other hand, a manually designed curriculum requires significant Minecraft-specific expertise, and does not take into account the agent’s live situation. It falls short in the experimental results compared to our automatic curriculum.
Finally, self-verification makes the performance much better. Have one GPT-4 critique and observe another GPT-4, that’s the secret!

TidyBot: Personalized Robot Assistance with Large Language Models (May 2023, added 5/17/23)

(link)

This paper introduces a generally good idea: LLMs are good at textual generalization, so use them to learn generalized, personalized rules for controlling a personal robot. This idea that an LLM is good at generalizing in the complicated, messy real world is a good one and should be applicable in many places.

Concretely: A robot must learn user preferences that can be generally reapplied to future scenarios. The paper studies the personalization of household cleanup with robots that can tidy up rooms by picking up objects and putting them away, by having them learn where the user prefers to store things. Robots can combine language-based planning and perception with the few-shot summarization capabilities of large language models (LLMs) to infer generalized user preferences that are broadly applicable to future interactions.

So, this is clever: LLMs demonstrate astonishing abilities to perform generalization through summarization, drawing upon complex object properties and relationships learned from massive text datasets. By using the summarization provided by LLMs for generalization in robotics, we hope to produce generalized rules from a small number of examples, in a form that is human interpretable (text) and is expressed in nouns that can be grounded in images using open-vocabulary image classifiers. Below is a simple example: the robot lists the object it sees in Python format, and the first few examples are controlled by the user. Then the LLM generalizes the shaded output, and the robot executes.

CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society (Mar 2023, added 4/29/23)

(link)

The paper has a simple idea: use two different GPT-4 prompts, and let them role-play to solve a problem. For example:

Idea = Develop a trading bot for the stock market
GPT-4 prompt #1 = “you are a python programmer”
GPT-4 prompt #2 = “you are a stock trader”
Then put the output from one as input into the other, and let them go back and forth to see if the problem gets solved.

System 2 Attention (Is Something You Might Need Too)

(Nov 2023, link)

A problem with LLMs is that irrelevant detail in the input can throw off producing the correct answer. The paper’s very simple prompting idea: regenerate the input context to only include the relevant portions, before attending to the regenerated context to elicit the final response. Here is a good example: the prompts mention an irrelevant city, and that confuses the answer.

Here is how the system works: on the left, the irrelevant sentence (“Max has 1000 more books than Mary”) confuses the answer. On the right, the System-2-Attention system removes that sentence.

It is insane how much this matters, at least for Llama-2. Below is the LLM performance on a math benchmark. On the left, an irrelevant sentence is included at random, on the right, it is included as an in-topic distractor. The performance drops from the oracle prompt to the baseline - i.e., there is a massive impact. S2A helps improve performance.

The implementation is as simple as this:

An LLM Compiler for Parallel Function Calling

(Dec 2023, link)

This is a pretty simple idea that improves the implementation of an LLM agent paradigm: instead of having an LLM make a plan and then execute it step by step, we instead take the plan and see what we can execute in parallel. The chart below shows the ReAct paradigm at the left - ask an LLM to devise a plan to answer a question, then follow the plan step by step. On the right, this paper introduces their “compiler”: the difference simply is that once it has the plan, it creates a directed acyclic graph of the steps, and it starts executing those steps that it can in parallel.

Here is a good example: all that needs to get figured out at the end of the planning stage is which of the tasks need to get done in which sequence. For example, task 3 requires inputs from tasks 1 and 2, so those need to be done before.

Here are more examples for these kinds of sequences. Simple stuff.

PromptBreeder: Self-Referential Self-Improvement Via Prompt Evolution

(Sep 2023, link)

This is another paper (like “Self-Taught Optimizer”) that takes a particular problem that is to be solved by an LLM, then starts with an initial prompt to steer the LLM, and then asks the LLM to also evolve that prompt. It seems to clearly improve on existing prompting methodologies (like chain-of-thought, CoT), as the table below shows. For example, in a mathematical domain on the benchmark GSM8K, PB evolved the task-prompt: "Show all your working. II. You should use the correct mathematical notation and vocabulary, where appropriate. III. You should write your answer in full sentences and in words. IV. You should use examples to illustrate your points and prove your answers. V. Your workings out should be neat and legible".

The methodology for evolving the prompt is really quite clever and shown in the chart below: there are very specific “mutation operators” that the paper prescribes. So, when the program starts, the prompt gets initialized to the following:

Start with the simplest task description. For example: "Solve the math word problem, giving your answer as an arabic numeral".
Pick a randomly drawn “thinking style”. For example: chain-of-thought = “let’s think step-by-step”.
Pick a randomly drawn “mutation prompt”. For example: “Make a variant of the prompt.”
Then concatenate all these strings into the final prompt. For example, that could be: “Make a variant of the prompt. Let’s think step by step. INSTRUCTION: Solve the math word problem, giving your answer as an arabic numeral. INSTRUCTION MUTANT:”

But even more convoluted, we can also involve the mutation prompt itself. That’s what the chart below shows: a direct prompt is just one where we ask the LLM for a differently phrased prompt. A mutation-prompt guided prompt is where we tell the LLM “do the following with this prompt”, and we ask it to generate a new prompt. A hyper mutation is where we ask the LLM to change that mutation prompt, so we’re evolving the mutation prompt.

The paper includes a really cool and long list of a) thinking styles and b) mutation prompts. Very useful to run a system that does this.

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

(Dec 2023, link)

Lots of good methodologies exist to have LLMs act as agents that can use external tools and external knowledge retrieval. But, the basic issue with improving their performance is that multi-step human performance feedback is rare: we know what a good answer looks like, but we haven’t written down at scale how to get there. So this paper proposes producing automatic feedback for the entire retrieval and processing process of an LLM search agent. This uses the recently proposed ReST algorithm (Reinforced Self-Training): In the outer loop (“grow”), the dataset is grown by sampling from the latest policy, and in the inner loop (“improve”), the policy is improved on a fixed dataset via ranking or filtering with reward model. Here, sampling during “grow” means producing a multi-step trajectory to completion, and ranking as part of “improve” is done directly with LLM call rather than with a distilled reward model of human preferences.

Here is the loop their agent follows: the agent keeps looping while deciding whether it needs additional information and requests it if so. If it decides it has enough information, it writes a draft answer. It then performs two additional self-check calls: to verify if the answer is relevant to the original question, and to verify that the answer uses the information it pulled in. Straightforward stuff.

Here is now an interesting twist to this paper: it uses code as the LLM prompt!

Now comes the actual ReST algorithm:

Take 500 questions from a benchmark and run them through the above algorithm.
Answering those questions will generate step-by-step reasoning traces. Split those up and get a few reasoning traces for every question, several times over (i.e., run the above loop several times). This is the “grow” stage of ReST.
Then use those reasoning traces for finetuning the LLM. This is the “improve” stage of ReST.
Then repeat this process: use the finetuned model to answer questions, generate reasoning traces, finetune more.

This is how it works, it ekes out 5 points in improvement for a large model:

14: Multimodality →