09: Fine-tuning

RAG vs. Fine-tuning: Pipelines, Trade-offs, And A Case Study on Agriculture

(Jan 2024, link)

The paper compares the performance of LLMs in answering agriculture questions for a base LLM, a fine-tuned LLM, and a retrieval-augmented generation LLM. An example question: “What is the best times to plant trees and shrubs in [Arkansas, Connecticut, Georgia]?”. Here is the pipeline the paper set up: start with an agriculture dataset, then generate Q&A data, then use that to a) fine-tune an LLM and b) set up RAG, then compare all of those.

Step 1 is the information extraction from the agri dataset. The main problem here is information extraction from PDFs. Their goal is to not only to recover the content of each file, but also its structure - for example: sections and subsections, parsing the information presented in tables and diagrams, identifying cross-references within the document, and linking images with their caption and description. Various tools are available for this, but they are all deficient:

pdf2text: Able to recover the textual information, but markers representing the beginning of a section or subsection are lost within the retrieved data, hindering our ability to reason over the document structure. Captions of tables and figures are also lost in conversion but sometimes contain critical information for the understanding of the document.
pyPDF: Seems to have other limitations.

They decide to use GROBID, a machine learning library specifically tailored for extracting and processing data from scientific literature in PDF format. The use of GROBID, trained on a vast corpus of scientific articles, enables the recognition of a wide array of document elements and extraction of associated bibliographic data: it extracts a JSON description of the document’s content and structure.

They then use an LLM to generate 5-15 questions per document section.

Final setups:

They set up the following RAG pipeline:

Calculating embeddings for text chunks using sentence transformers
Retrieve relevant text chunks using FAISS with similarity_search_with_score
Use GPT-4 plus text snippets to generate the answer.

They set up the following fine-tuning pipeline:

They fine-tune Llama 2 and GPT-4

Here are the results. For RAG, retrieving more text chunks improves results:

What’s also interesting is that the more content (base documents) you jam into the knowledge base, the worse recall gets, but it’s a relatively gradual decline:

This is the most important table, comparing fine-tuning vs. RAG. Really interesting, fine-tuning Llama2 doesn’t do anything for accuracy, only RAG improves it:

Correctness is here. RAG makes a big difference here, much more than fine-tuning.

The following is a great table too: can GPT-4 learn new knowledge through fine-tuning? The answer is yes.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Jun 2023, added 6/19/23)

09: Fine-tuning

Contents

RAG vs. Fine-tuning: Pipelines, Trade-offs, And A Case Study on Agriculture

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Jun 2023, added 6/19/23)

Fine-tuning LLMs

LIMA: Less Is More for Alignment (May 2023, added 5/23/23)

Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models (May 2023, added 5/23/23)

Multitask Prompted Training Enables Zero-shot Task Generalization (Mar 2022)

Exploring the Benefits of Training Expert Language Models over Instruction Tuning (Feb 2023)

SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks (Oct 2022)

Scaling Instruction-Finetuned Language Models (Dec 2022)

One Embedder, Any Task: Instruction-Finetuned Text Embeddings (Dec 2022)

OPT-IML : Scaling Language Model Instruction Meta Learning through the Lens of Generalization (Jan 2023)

WizardLM: Empowering Large Language Models to Follow Complex Instructions (Apr 2023, added 4/29/23)

QLORA: Efficient Finetuning of Quantized LLMs (May 2023, added 5/25/23)

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning (Mar 2023)

Parameter-efficient fine-tuning of large-scale pre-trained language models (Mar 2023)

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention (Mar 2023)

Efficient Large Language Model training with LoRA and Hugging Face (Mar 2023)

Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering (May 2023, added 5/23/23)

Hugging Face: Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU (Mar 2023)

Accelerating LLaMA with Fabric: A Comprehensive Guide to Training and Fine-Tuning LLaMA (Apr 2023)

StackLlama (Apr 2023)

Reward Design With Language Models (Mar 2023)

Could you train a ChatGPT-beating model for $85,000 and run it in a browser? (Mar 2023)

Which GPU To Get For Deep Learning (Jan 2023)

High-throughput Generative Inference of Large Language Models with a Single GPU (Mar 2023)

Vicuna-13B: Open-source Chatbot on Llama (Apr 2023)

Building a conversational AI agent by finetuning an LLM (Jan 2023)

BloombergGPT (Mar 2023)

The Economics of Large Language Models (Jan 2023)

Specializing Smaller Language Models towards Multi-Step Reasoning (Mar 2023)

PMC-LLaMA: Further Finetuning LLaMA on Medical Papers (Apr 2023, added 5/7/23)

Summary Of Models Available For Fine-Tuning & In-house Execution