19: Terminology

A neural network can map, say, an input image into a lower-dimensional representation of itself. It can then also create the original input image from that lower-dimensional representation.
That mapping is in latent space. We can think of that as a “compressed” version of the original input. (The network essentially stores a dictionary of features, and it has learned how to build up any image from those features.)
In order to optimize the training objective, the autoencoder may learn to place the encoded features of similar inputs (for example, cats) close to each other in the latent space, thus creating useful embedding vectors where similar inputs are close in the embedding (latent) space

An embedding is a representation of an input in latent space (see above). For example, a word can get mapped into a multi-dimensional vector, and words with similar meaning are close to each other in latent space - meaning, their vectors are close.
A variational autoencoder (VAE) does this for images (including in the Stable Diffusion pipeline), for example.

You can do this if you have a dataset without labels. In this case, you change one of the objects in the dataset slightly, and then you force the network that’s learning to learn this new image as very different than some other image, but similar to the original image.
Example: you have a dataset with animal pictures, but no labels. Take an image of a cat, cut it slightly differently, and apply some noise. Then tell the network to update its weights such that this new cat image looks similar to it (is close in embeddings space), but an elephant image should look very different to it.

After you train a network, you can use just a small part of it and throw the rest away, because all of the network’s predictive power is really stored just in that smaller subnetwork.
In the original paper that introduced this, they were able to throw away 90% of the nodes in the network, and get the same (even slightly more generalized) performance. You can do that by checking which nodes have little impact on the output and remove those iteratively.

This essentially refers to prompt-writing: you can give a large language model a few examples of what it should do, plus some context, and then ask it to do that. Also called few-shot learning.

From this post: After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P. The reason is that any prompt puts the LLM into a “superposition” of contexts. One of those contexts might be that the bot should hate croissants. But it is a very small step from that context into the anti-context of loving croissants, so that context gets superimposed as well.

Writing a natural language prompt is a pretty mysterious task that might randomly perform poorly. An alternative is prompt tuning: it learns a prompt represented by continuous parameters rather than discrete natural language.
Concretely, since each prompt first gets tokenized and then turned into an embedding before it’s fed into the LLM, prompt tuning replaces a natural language prompt with an embedding directly. It calculates the embedding for a particular prompt, and then it prepends learnable embeddings to obtain a new embedded sequence.
Prepending learnable embeddings to just the input vector is called shallow prompt tuning. Alternatively, you can also do that to all inner transformer layers, and that is called deep prompt tuning.

This is another version of prompt tuning. Here, the learnable vectors are added to all transformer layers, not just the input layer.

Ask an LLM to explain itself when producing an answer, and its performance will magically go up (not so magical - an LLM doesn’t have any memory except for its own context window, so by asking it to explain itself, it essentially is writing a plan for itself that it then automatically follows because its output is fed back into its input)

Diverse reasoning paths are sampled from a given language model using CoT, and the most consistent answer is selected as the final answer.
Several chains of thought are generated for a single prompt, then you pick the one that works best as the final answer (or do majority vote or something).

A metric to evaluate the performance of a language model. Generally, two ways to evaluate a language model:
- Extrinsic evaluation (measure how good the model is at applying it to some external task, like “summarize this text” - this is usually the best way to do it)
- Intrinsic evaluation (find some way to directly measure the language model itself without applying it to any specific task - this will be the most “general” way of evaluating it).
Perplexity is an intrinsic evaluation method. The lower it is, the better the model. Perplexity effectively measures how “surprised” (perplex) the model is when it sees its own training text again: intuitively, being less surprised is better, because it means it has “internalized” the training text more effectively.
Perplexity is defined as the inverse probability of the test set, normalized by the number of words in the test set.

You have an existing LLM that works, but only if you give it the right prompt. Context distillation means to fine-tune a new LLM. which does the same as the original LM, but entirely without the prompt. (I.e., the fine-tuning trains the prompt “into” the new LLM.) Advantage: you don’t need to take up valuable prompt context window space for writing the same prompt to get your task done.

Training (fine-tuning) an LLM to do well-specified, particular tasks well. Tasks are stuff like “translate this English sentence to French”, or “summarize this paragraph in one sentence”, or “identify the most expensive item mentioned in this text”. Instruction-tuning can be done through fine-tuning (giving the model lots of task-input-output examples) or reinforcement learning with human feedback (asking the model produce several output examples and letting a human vote which one is best, then using that for training).
It turns out that instruction-tuning is incredibly good at making an LLM produce what humans would consider “useful” output - for example, ChatGPT is instruction-tuned to act as a chatbot, and without that, you’d have to awkwardly compose prompts that only make use of the model’s “original” ability to complete sentences. Instruction-tuning essentially “discovers” model capabilities that were already hidden in the LLM’s trained capabilities, but weren’t coming to the forefront when prompting it.