05: Agents: Teaching LLMs To Act

- Toolformer: Language Models Can Teach Themselves to Use Tools (Feb 2023)

Toolformer: Language Models Can Teach Themselves to Use Tools (Feb 2023)

There are several examples that augment language models with additional textual information during pre-training. In those cases, additional information is always provided. The idea of this paper is to instead fine-tune the LLM so that it knows when to ask for additional information during token generation.

LLMs would perform better if they knew precisely when to call external APIs. The paper proposes a simple way to generically train an LLM to do that. Here is the algorithm:

Take a number of training examples - short texts that could benefit from some API call to have better knowledge. For example: “Out of 1400 participants, 400 (or 29%) passed the test.”
This is the kind of stuff that LLMs often screw up, because they don’t know how to do math reliably. So instead, teach the model how to annotate such a training example by telling it to instead insert an API call: “Out of 1400 participants, 400 (or [Calculator(400/1400)→ 0.29] 29%) passed the test.”
The way to do that is to first pick a few APIs. Each of them returns just text based on a well-defined text input. For example:
- [QA(“Who is the publisher of The New England Journal of Medicine?”) → Massachusetts Medical Society]
- [MT(“tortuga”) → turtle]
- [WikiSearch(“Brown Act”) → The Ralph M. Brown Act is an act of the California State Legislature that guarantees the public's right to attend and participate in meetings of local legislative bodies.]
Then, we pick a few training examples. Each of these is a short text. In that short text, we prompt the model to insert API calls. We simply do that by writing a prompt that tells the model how to do that:

Now feed a few examples into x, and the model should reproduce that same example text, but with a few locations replaced by an API call command. The paper then also looks at the confidence value generated by the LLM for each place where it inserted an API call, and it only keeps the k top ones and those above a certain confidence threshold (so you don’t get the LLM throwing around API calls all over the place).

If you feed 10 examples into x above, you get 10 annotated examples, where the model inserted an API call into the example. However, now we do further filtering.
- Take the example prompt: x1:i-1 = “Pittsburgh is also known as”. Say the LLM responds to the annotation prompt above by coming back with two proposed APIs:
  - (1) ci1 = [QA(What other name is Pittsburgh known by?)]
  - (2) ci2 = [QA(Which country is Pittsburgh in?)]
- Now execute each of these calls.
- Now re-run the following examples through the LLM:
  - (a) “Pittsburgh is also known as”
  - (b) “[QA(What other name is Pittsburgh known by?)] → Steel City] Pittsburgh is also known as”
  - (c) “[QA(Which country is Pittsburgh in?)] → United States] Pittsburgh is also known as”
- See how (a) is just the original prompt. But now we’re pre-fixing the API call to the original prompt, and we see what the model generates now. Then calculate the weighted cross entropy loss for each of these answers. For the prompt with the best “evaluation”, store it as a training example. Here, that would be:
  - “Pittsburgh is also known as [QA(What other name is Pittsburgh known by?)]”
Those new prompts now become the source material for fine-tuning the LLM!
By fine-tuning the LLM, we teach it to automatically generate API-call-tokens as part of its token prediction. Then we just need to listen for each new generated token, and if it’s an API call, just execute the call, and insert that response as the token into the LLM’s token generation output, and continue from there.

Autoformalization with Large Language Models (May 2022)

(link)

Autoformalization is the process of automatically translating from natural language mathematics to formal specifications and proofs.

Note: this can be great for our idea of automatically configuring healthcare systems
We make the surprising observation that LLMs can correctly translate a significant portion (25.3%) of mathematical competition problems perfectly to formal specifications in Isabelle/HOL
Problem: Formal mathematics data is very scarce. For example, one of the largest formal mathematics libraries, the Archive of Formal Proofs, is only 180MB in size, that is less than 0.18% of the training data for the large language model Codex.

The paper does something extremely simple: it gives a k-shot example (here, just 2 examples!) for translating a mathematical proof request written in Latex into a formalized language (here, Isabelle/HOL), then it asks the model to do it on other proofs.

Improvement opportunities:

Expert Iteration: In neural theorem proving, one way to get better quality data is to use feedback from the proof checker to run many proof searches (or generate multiple proofs) and check the proof attempts for correctness. Newly found correct proofs can then be used as the new training data to improve the neural prover.

Mathprompter: Mathematical Reasoning Using Large Language Models (Mar 2023)

(link)

The paper shows a smarter way to instruct LLMs to do mathematical reasoning: by producing code that solves a mathematical problem, which can then be verified formally for correctness.

Take in the original problem: “At a restaurant, each adult meal costs $5 and kids eat free. If a group of 15 people came in and 8 were kids, how much would it cost for the group to eat?”
Replace the numbers in the problem with algebraic variables and generate that text: “Qt: at a restaurant, each adult meal costs A and kids eat free. if a group of B people came in and C were kids, how much would it cost for the group to eat? Mapping: {A:5, B:15, C:8}”
Now give the LLM two prompts to create two different ways to solve this:
1. Algebraic prompt: Write a mathematical equation and generate the answer format starting with ‘Answer =’
2. Python prompt: Write a Python function that returns the answer
Now run both answers (the algebraic prompt through the LLM, and the Python prompt through an interpreter) and substitute in the correct numbers
Do steps 3-4 5x, and compare all the output results. Pick the one with the highest vote.

This improves the da-vinci performance on the MultiArith dataset:

Self-planning Code Generation with Large Language Model (Mar 2023)

(link)

Really simply idea in this paper: have an LLM create code - but first, have it create its own code step-by-step plan, where then each step gets filled in with code.

Instead of asking this: “Create a function encrypt that takes a string as an argument and returns a string encrypted with the alphabet being rotated. The alphabet should be rotated in a manner such that the letters shift down by two multiplied to two places.”

Ask this instead: “Create a function encrypt that takes a string as an argument and returns a string encrypted with the alphabet being rotated. The alphabet should be rotated in a manner such that the letters shift down by two multiplied to two places. For example:

encrypt(’hi’) returns ’lm’

encrypt(’asdfghjkl’) returns ’ewhjklnop’

encrypt(’gf’) returns ’kj’

encrypt(’et’) returns ’ix’

Let’s think step by step.

=> Then the model will first create the step-by-step plan, and then it will generate the code:

Create a alphabet, bias two places multiplied by two.
Loop the input, find the latter bias letter in alphabet.
Return result.

[Code comes here]

The paper shows that this leads to better-functioning code.

ReAct: Teaching LLMs how to use tools & coding it (Mar 2023)

(link)

This is based on a paper called ReAct, and the idea is simple: tell the LLM to (1) use chain-of-thought and (2) tell itself to use outside sources (like Wikipedia) when it needs something and (3) intercept the LLM’s output whenever the LLM says it needs to go outside. Two ways to do it:

Use LangChain
Write it yourself directly calling GPT-4

Writing it yourself is incredibly simple. The full code is on the link. Here is the prompt:

“You run in a loop of Thought, Action, PAUSE, Observation.

At the end of the loop you output an Answer

Use Thought to describe your thoughts about the question you have been asked.

Use Action to run one of the actions available to you - then return PAUSE.

Observation will be the result of running those actions.”

Then list the actions in the same prompt:

“Your available actions are:

calculate:

e.g. calculate: 4 * 7 / 3

Runs a calculation and returns the number - uses Python so be sure to use floating point syntax if necessary

wikipedia:

e.g. wikipedia: Django

Returns a summary from searching Wikipedia

simon_blog_search:

e.g. simon_blog_search: Django

Search Simon's blog for that term”

=> That’s all!

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace (Apr 2023)

(link)

Clever paper that proposes a system based entirely on GPT-3.5 that uses the LLM as an orchestrator of all other models on HuggingFace to solve complex problems. Simple:

Ask the LLM to devise a plan as to the sub-tasks to break a complex task into
Search HuggingFace model descriptions to solve each sub-task, pick the top-k model descriptions for each sub-task, then send all of those to the LLM for it to pick the most appropriate one
Use the LLM to generate the API call to the model that solves a particular sub-task
Pop all the results back into the LLM and have it generate the overall output

The entire paper uses no fine-tuning and no other modifications to GPT-3.5, it just cleverly utilizes language information from HuggingFace and the LLM’s own output.

ART: Automatic multi-step reasoning and tool-use for large language models (Mar 2023)

(link)

When you need to call external models as part of LLM reasoning, you currently have to finetune the LLM to get it to create those calls. The paper presents Automatic Reasoning and Tool use (ART), a framework that automatically generates decompositions (multi-step reasoning) for instances of new tasks. It does not require fine-tuning and instead uses a frozen LLM.

Selects and uses the most appropriate available tools (like search engines, and code execution) in individual steps
Retrieves demonstrations of related tasks from a task library to enable few-shot decomposition and tool use.
Demonstrations follow a flexible but structured query language, such that it is easy to parse intermediate steps, stop generation to call external tools, and resume it after including the output of such tools
Comparison to other frameworks/papers:

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs (Mar 2023)

(link)

This is an “LLM-only” version of the Toolformer paper, and thus similar to HuggingfaceGPT: it uses an LLM to connect to lots of external APIs (such as a Powerpoint API, or various vision APIs to understand images) to solve problems in steps. However, the additional clever idea is that it forces the LLM to write API-driven code to solve all problems, including if it has to write an essay. In other words, all problems initially get converted into pseudo code that includes API calls. To start, it creates an “API platform” where it defines all available API calls. Here is an example for “open a file”:

Here is an example of back-and-forth interaction for writing an essay:

Here is an example for how they define the API platform. All of the following is part of the initial prompt to the LLM.

API Documentations. As an assistant for generating and editing slides, you have access to a list of APIs to control PowerPoint with the following functions:
- create_slide(): This API is used to create a new slide.
- insert_text(text:str): This API is used to insert text into a text box. The content of each slide should contain several sentences, you can call insert_text multiple times, or split the multiple sentences by ’\n’.
- select_title(): This API is used to select the text box of the title. You should first select the text box of the title and then insert or delete the text in the text box of the title.
- select_content(): This API is used to select the text box of the content. You should first select the text box of the content and then insert or delete the text in the text box of the content.
- move_to_slide(slide_id:int): This API is used to move to a specific slide in the presentation. It can take one parameter, slide_id: the ID of the slide to move to as an integer
The current version of PPT is:
- Page: 1
  - Title: Big Technical Companies
  - Visual Positions:
    - …
- Page: 2
  - Title: Big Technical Companies
  - Contents:
    - Microsoft (1975)
    - Apple (1976)
    - Amazon (1994)
    - Google (1998)
    - Facebook (2004)
    - Visual Positions:
The History of our conversation:
- Human: I hope to create a PPT about big technical companies. Can you create a slide to list some of them?
- AI: …
General prompts:
- Don’t define new functions.
- In the output, each line should contain only one function. And each line should end with a ";".
- Please finish my following instruction with the functions I defined.
Human: For each company, let’s create one slide to introduce its founder, location, mission, products, subsidiaries:
- MCFM: Sure, here’s the code to generate slides for each company: etc.

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models (Apr 2023, added 4/29/23)

(link)

This is a similar approach to Toolformer, HuggingGPT and Taskmatrix: use GPT-4 to call other modules, compose the output from those modules, and get an overall result. Chameleon synthesizes programs to compose various tools, including LLM models, off-the-shelf vision models, web search engines, Python functions, and rule-based modules tailored to user interests.

Chameleon with GPT-4 achieves an 86.54% accuracy on ScienceQA, significantly improving upon the best published few-shot model by 11.37%; using GPT-4 as the underlying LLM, Chameleon achieves a 17.8% increase over the state-of-the-art model, leading to a 98.78% overall accuracy on TabMWP
What they do different than Toolformer etc. is that they more cleverly figured out how to create sequentializable modules: e.g., it’s smarter to call a “table explainer” module before sending an entire large table into a prompt that’s supposed to be smart about it.

Here is a nice example of how it works:

Here is a nifty comparison table on how this stacks up against these others tools/papers. Seems like the basic idea is exactly the same, but they connected it to more modules/tools.

Here are the modules they wrote:

Knowledge Retrieval: look something up online
Bing Search: obvious
Query Generator: create search engine queries based on the given problem, which is then used as inputs to the “Bing Search” module - this is smart serialization
Image Captioner: produces caption for an input image
Text Detector: identify text in an input image
Row Lookup: return a simplified version of a table, retaining only the rows pertinent to the question; the module accepts a question and a table as input and outputs the simplified table
Column Lookup: same idea as Row Lookup
Table Verbalizer: convert tables into easily comprehensible descriptions for downstream modules like “Program Generator” and “Solution Generator”
Program Generator: creates Python code to solve a given problem
Program Verifier: ensure the validity and error-free nature of the programs generated by the “Program Generator”
Program Executor: executes Python programs
Solution Generator: generate a detailed solution to the input query, taking into account all the information stored in the cache, using chain-of-thought
Answer Generator: typically the final module in a module chain, extracts and normalizes the answer from the results generated by the “Program Executor” or “Solution Generator” using a rule-based approach

It performs well. However, it is noteworthy how little it improves on GPT-4 without any of this massive machinery around it. Just goes to show how powerful GPT-4 is without anything else around it.

The difference is bigger on TabMWP. But even here, going from GPT-4 to GPT PoT (program of thought) gets a big part of the way there. “Program of thought” simply means prompting GPT-4 to write an executable program as answer to a question, not to generate the entire answer itself.

Finally, this a nifty table of how complex these answer chains get. See how GPT-4 Program of Thought always creates 1 program as answer, but it has a length > CoT. Chameleon gets tons of different modules that run sequentially.

06: Interpretability Research →