10: Memory & Retrieval for LLMs

Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study (Apr 2023)

(link)

The paper tackles the issue of retrieval: how to incorporate external knowledge more effectively into an LLM? It studies whether it makes sense to incorporate an actual retrieval mechanism into the training of the model. The architecture is quite simple:

  • Use a GPT-like LLM that is getting trained with a large pre-training corpus (this is the same as, say, GPT-3.5)

  • Add a retrieval database: this is simply a key-value database, and you can look up knowledge in it

  • During training time (not just when you’re prompting the model!), the input that’s given to the model is actually also run through the retrieval database, and whatever knowledge is extracted is also put into the input, before running through the LLM as usual for training.

  • The knowledge stored in the retrieval database is simply the same knowledge as what is used for training the entire LLM. (One idea here is that it is way easier to update the knowledge in that retrieval database and keep it up-to-date - you just write new values into the database because it’s just a database. No re-training needed whatsoever.)

This yields some improvement in performance in more knowledge-intensive tasks while not degrading performance in other tasks, but it’s not super impressive:

It’s driving more performance for question-answering tasks:


Generative Agents: Interactive Simulacra of Human Behavior (Apr 2023)

(link)

Very cool paper that uses an LLM to independently simulate 25 characters in a simulation, and keeps track of their memories, reflections and plans. Here are the most important implementation details:

  • Objects in the environment (like a bed or refrigerator) can be influenced by the agents, or by user input, for example: “<Isabella’s apartment: kitchen: stove> is burning”

  • All agents follow this loop, updated every few minutes in game time:

  • Memory stream: It is a list of memory objects, where each object contains a natural language description, a creation timestamp and a most recent access timestamp. The most basic element of the memory stream is an observation, which is an event directly perceived by an agent. Common observations include behaviors performed by the agent themselves, or behaviors that agents perceive being performed by other agents or non-agent objects. For instance, Isabella Rodriguez, who works at a coffee shop, might accrue the following observations over time: “(1) Isabella Rodriguez is setting out the pastries, (2) Maria Lopez is studying for a Chemistry test while drinking coffee, (3) Isabella Rodriguez and Maria Lopez are conversing about planning a Valentine’s day party at Hobbs Cafe, (4) The refrigerator is empty”

  • Memory retrieval function: Takes the agent’s current situation as input and returns a subset of the memory stream to pass on to the language model. It uses these inputs:

    • Recency assigns a higher score to memory objects that were recently accessed, so that events from a moment ago or this morning are likely to remain in the agent’s attentional sphere. In our implementation, we treat recency as an exponential decay function over the number of sandbox game hours since the memory was last retrieved. Our decay factor is 0.99.

    • Importance distinguishes mundane from core memories, by assigning a higher score to those memory objects that the agent believes to be important. The system simply prompts the LLM: “On the scale of 1 to 10, where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a break up, college acceptance), rate the likely poignancy of the following piece of memory. Memory: buying groceries at The Willows Market and Pharmacy. Rating: <fill in>”

    • Relevance assigns a higher score to memory objects that are related to the current situation. In our implementation, we use the language model to generate an embedding vector of the text description of each memory. Then, we calculate relevance as the cosine similarity between the memory’s embedding vector and the query memory’s embedding vector.

  • Reflection: Agents struggle to properly inference if memory only consists of observations. We introduce a second type of memory, which we call a reflection. Reflections are higher-level, more abstract thoughts generated by the agent. Because they are a type of memory, they are included alongside other observations when retrieval occurs.

    • We query the large language model with the 100 most recent records in the agent’s memory stream (e.g., “Klaus Mueller is reading a book on gentrification”, “Klaus Mueller is conversing with a librarian about his research project”, “desk at the library is currently unoccupied”) and prompt the language model, “Given only the information above, what are 3 most salient high-level questions we can answer about the subjects in the statements?” The model’s response generates candidate questions: for example, “What topic is Klaus Mueller passionate about?”

    • We use these generated questions as queries for retrieval, and gather relevant memories (including other reflections) for each question. Then we prompt the language model to extract insights and cite the particular records that served as evidence for the insights. The full prompt appears below: “Statements about Klaus Mueller: 1. Klaus Mueller is writing a research paper 2. Klaus Mueller enjoys reading a book on gentrification 3. Klaus Mueller is conversing with Ayesha Khan about exercising [...] What 5 high-level insights can you infer from the above statements? (example format: insight (because of 1, 5, 3))”

    • This process generates statements such as “Klaus Mueller is dedicated to his research on gentrification (because of 1, 2, 8, 15)”.

    • Reflection explicitly allows the agents to reflect not only on their observations but also on other reflections: for example, the second statement about Klaus Mueller above is a reflection that Klaus previously had, not an observation from his environment. As a result, agents generate trees of reflections.

  • Planning: Agents need to plan over a longer time horizon to ensure that their sequence of actions is coherent and believable. Otherwise, the agent will have lunch each 30 minutes from 12 to 2pm.

    • Plans describe a future sequence of actions for the agent, and help keep the agent’s behavior consistent over time. A plan includes a location, a starting time, and a duration.

    • The first step is to create a plan that outlines the day’s agenda in broad strokes. To create the initial plan, we prompt the language model with the agent’s summary description (e.g., name, traits, and summary of their recent experiences) and a summary of their previous day.

    • This generates a rough sketch of the agent’s plan for a day, divided into five to eight chunks: “1) wake up and complete the morning routine at 8:00 am, 2) go to Oak Hill College to take classes starting 10:00 am, [. . . ] 5) work on his new music composition from 1:00 pm to 5:00 pm, 6) have dinner at 5:30 pm, 7) finish school assignments and go to bed by 11:00 pm.”

    • The agent saves this plan in the memory stream and then recursively decomposes it to create finer-grained actions, first into hour-long chunks of actions, then each into 5-15 minute chunks.

  • Reacting: Generative agents operate in an action loop where, at each time step, they perceive the world around them and those perceived observations are stored in their memory stream. We prompt the language model with these observations to decide whether the agent should continue with their existing plan, or react.

    • Example prompt: “[Agent’s Summary Description goes here] It is February 13, 2023, 4:56 pm. John Lin’s status: John is back home early from work. Observation: John saw Eddy taking a short walk around his workplace. Summary of relevant context from John’s memory: Eddy Lin is John’s Lin’s son. Eddy Lin has been working on a music composition for his class. Eddy Lin likes to walk around the garden when he is thinking about or listening to music. Should John react to the observation, and if so, what would be an appropriate reaction?

    • Here, the context summary is generated through two prompts that retrieve memories via the queries “What is [observer]’s relationship with the [observed entity]?” and “[Observed entity] is [action status of the observed entity]”, and their answers summarized together.

  • Dialog: Dialog is generated by prompting the LLM with each agent’s summary description, and then the most recent thing the other agent said, asking the LLM to respond.

The believability of the simulation is measured as follows in an ablation study (meaning, when only activating certain components of it), where “human” is Mechanical Turk workers: