11: Practical Applications, Part I: Clinical

EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records

(Jan 2024, link)

The paper describes a system that: 1) lets clinicians ask questions of EHR data in natural language, 2) has an LLM agent that generates and executes code to interact with that EHR data. So, the trick is that the system doesn’t directly use EHR data, but rather just writes code to interact with the EHR data. The chart below shows an example: the EHRAgent is prompted with the prompt in the image, which gives it access to specified API functions and examples to write code to access the EHR data. The agent then plans the task, writes the code and executes it. Then it refines the code: if it encounters an error, it sends it back to the LLM to re-write the code.

There are some interesting ideas in this setup:

Long-term memory: When the agent executes successful code snippets that it has generated and it realizes that a snippet worked, it stores the snippet with a description in a long-term memory. When encountering a new question, it then first compares the question to all the questions stored in the long-term memory and picks the best examples to use as part of its few-shot prompt.
Interactive coding: This is nothing new, but it works really well here: after generating code, it runs it and sends any error messages back to the LLM for code refinement.
Medical knowledge integration: This is a bit of a misnomer, but this just means a detailed description of how data is stored in the EHR. The paper actually initializes this by asking an LLM to generate natural-language descriptions of the EHR data architecture by sending table schemas and metadata to the LLM, and then store those descriptions that get generated.

The table below shows what happens to accuracy when removing certain components of the setup. The biggest impact comes from removing the initial medical knowledge, and from removing interactive coding.

Good data in the chart below: once you’ve given 4 examples, giving more doesn’t make any difference anymore.

Health system-scale language models are all-purpose prediction engines (Jun 2023, added 6/16/23)

Cool paper: the NYU health system in New York used the clinical notes in their EHR to train a language model, and then used that for five predictive tasks (such as, will this patient get readmitted in 30 days?). It turns out that by just doing loss-minimizing language training on clinical notes, you get an all-purpose clinical prediction engine. Overall:

Model can do 5 tasks: 30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay prediction, and insurance denial prediction
NYUTron has an area under the curve (AUC) of 78.7–94.9%, with an improvement of 5.36–14.7% in the AUC compared with traditional models

Here is how the training went:

They collected a set of unlabeled clinical notes and five task-specific labeled clinical notes from the NYU Langone EHR. The large unlabelled dataset comprises 7.25 million clinical notes (for example, radiographic reads, history and physicals) from 387,144 patients across four hospitals, resulting in a 4.1 billion-word corpus curated from January 2011 to May 2020. Each one of the labeled fine-tuning sets contains 1-10 years of inpatient clinical notes (55,791–413,845 patients, 51–87 million words) with task-specific labels (2–4 classes).
They pretrained and fine-tuned an LLM for each downstream task using a bidirectional encoder model known as BERT and a masked language modeling (MLM) objective on the NYU Notes dataset until the validation loss plateaued (MLM: randomly mask words or subwords in clinical notes, then train the language model to fill in the masked word correctly).
Using the fine-tuning dataset, they then fine-tuned the pretrained model to predict the task label using the relationships learned in pretraining with clinical notes.

Results: left, readmission probability and in-hospital mortality prediction. Right, probability of insurance claim denial, and length-of-stay prediction. In each case, their language model improves on a classical machine learning model (using various features).

Other neat insights:

NYUTron was competitive with a small group of physicians at predicting 30-day readmission. For physicians and NYUTron, the median false positive rate (FPR) was 11.11%, whereas the median true positive rate (TPR) was 50% for physicians compared with 81.82% for NYUTron.
How do you compare the model to other language models? NYUTron had the highest area under the curve (AUC) when fine-tuned with the full dataset, with a median AUC of 80%. That is similar to an LLM trained on clinical+web-wiki+bio. Compared with LLMs pretrained with non-clinical text (web-wiki+bio and web-wiki), NYUTron’s median AUC was 2.37% to 3.23% higher. Compared with the traditional model that uses structured features (lace+xgb), NYUTron had a 5.36% higher AUC. Compared with a model using traditional natural language processing (NLP) embedding (tf-idf+xgb), NYUTron had a 12.8% higher median AUC.
- Interesting: you don’t really need proprietary notes, you can just use clinical notes from the web, and you get the same outcomes! (but you can’t fine-tune because you’re missing labels)
An LLM trained on unstructured clinical notes better scales with data than traditional structured models. NYUTron’s AUC consistently improved with more examples whereas lace+xgb’s AUC started to plateau (from 100 to 1,000 examples, NYUTron’s AUC increased by 7.27% whereas that of lace+xgb increased by 3.98%; from 10,000 to 392,336 examples, NYUTron’s AUC increased by 2.15% whereas that of lace+xgb increased by 0.63%).
Pretraining on a large amount of unlabeled clinical notes contributes to performance. Compared with the randomly initialized LLM (random-init), NYUTron learns to generalize better from fewer examples: whereas NYUTron needed 10,000 examples to achieve an AUC of around 75%, random-init needed 100,000 examples.
It is beneficial to match the domain of the pretraining corpus and the domain of the fine-tuning corpus. LLMs pretrained on non-clinical text (web-wiki and web-wiki+bio) had similar performance as random-init. A separate LLM, web-wiki+bio+clinical, had similar performance as NYUTron. Third, compared with LLMs pretrained on non-clinical text (web-wiki and web-wiki+bio), clinically pretrained LLMs (NYUTron and web-wiki+bio+clinical) learned to generalize better from fewer examples.

Towards Expert-Level Medical Question Answering with Large Language Models (May 2023, added 5/18/23)

(link)

This is the Google technical paper on Med-PaLM 2, its latest clinical-focused LLM. First, on the model itself:

Built on PaLM 2, Google’s latest LLM.
Then applied fine-tuning to the base model: the datasets used included the training splits of MultiMedQA, namely MedQA, MedMCQA, HealthSearchQA, LiveQA and MedicationQA. They mix those datasets in a specific way, and the mixture was determined empirically through testing. (Which is interesting, that the precise mixture makes a difference for the quality of the outcomes on the test benchmarks)

Now, in terms of model performance:

Overall certainly impressive: below is the benchmark that comes closest to what the standard US medical exam asks about

Even more impressively, both physicians and lay-people really can’t tell the difference anymore, and mostly prefer PaLM-generated answers to other physicians’ answers. The model is particularly strong by beating physicians on “omits more information” - not surprisingly, an LLM is just a much better lookup tool for the latest medical knowledge. But even on extent and likelihood of harm, the model is much better than physicians.

But here is also something interesting: despite all its fine-tuning on medical data, the model is not much better than general-purpose GPT-4!

By using ensemble refinement, you get another 2-5 points in performance. It’s a clever way to prompt the model.

Here is model vs. physician on various quality aspects, in this case for the benchmark on long-form answers. Really the only place where physicians still beat the model is in “no inaccurate/irrelevant information” - LLMs are just more chatty than physicians.

By lay-person raters, Med-PaLM 2 answers were rated as more directly relevant and helpful than Med-PaLM answers on the MultiMedQA 140 dataset. (the more solid the color, the more positive the answer)

Finally, a good general insight: here is the inter-rater reliability for various quality metrics - meaning, how often did the humans agree on how to evaluate a particular answer? What this shows is that the area where humans most disagree with each other is - whether information is unnecessary, or something important is missing. Not surprising, but good to know to evaluate model performance.

Capabilities of GPT-4 on Medical Challenge Problems (Mar 2023)

(link)

Runs GPT-4 on various medical exams. It performs much better than previous models.

Here is a weird thing: the medical test contains a bunch of images, and the paper does not pass those to GPT-4. Despite just reading text, the model still performs well.

Microsoft’s Peter Lee NEJM Interview (March 2023)

Sebastien Bubeck at Microsoft Research has a paper where they train a neural net to solve systems of linear equations. They do that by giving it large corporates of training data that contain systems of linear equations and their solutions.

When trained that neural net's able to do its task of solving these linear question systems, but if that training corpus only had systems of linear equations with up to three terms, if you then ask that trained neural net to solve a system of linear equations with four terms, it fails.
Start again with that same training corpus, but then add to it a big pile of text, say non-mathematical text from Wikipedia. go through the same training. The resulting neural net is able to solve all systems of linear equations, regardless of the number of terms.
There is some knowledge and some circuitry being distilled from the structure of language in ways that we do not fully understand yet, and that is leading at least to some form of emergent capability in these large language models.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine (Mar 2023)

(NEJM)

Using GPT-4 as a clinical chatbot: biggest concern is hallucination. For example, the model hallucinates a patient’s body-mass index even though it wasn’t measured, or made up its own clinical credentials.
One solution is to have GPT-4 check its own output: asking it to evaluate a chat history let it catch its own hallucinations.

Artificial Intelligence and Machine Learning in Clinical Medicine (Mar 2023)

(NEJM)

Nothing particularly interesting in here, just a general treatment

ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge (Mar 2023)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent ac leo posuere, cursus ex porta, accumsan tortor. Donec viverra fringilla leo, in suscipit diam tempus pulvinar. Nulla feugiat diam id ligula facilisis vulputate. Sed a egestas dui. Duis dapibus vitae lorem a ultrices. Duis luctus mi vitae est condimentum consequat et in arcu. Duis quis purus gravida, eleifend dolor nec, maximus lectus. Aenean eget nibh dui. Vestibulum a quam non leo venenatis ultrices non vel nulla. Integer nec convallis ex. Sed neque ex, ullamcorper ut feugiat id, consequat et neque. Quisque bibendum ligula neque, nec luctus odio tincidunt vitae. Phasellus iaculis urna arcu, non imperdiet turpis feugiat in.

Mathematical discoveries from program search with large language models

(Aug 2023, link)

This is a super cool Google paper in Nature which shows how you can use an LLM to evolve better code to solve a particular problem. The approach is simple: the input to FunSearch is a specification of the problem in the form of an ‘evaluate’ function, an initial implementation of the

function to evolve, which can be trivial, and potentially a skeleton. At each iteration, FunSearch builds a prompt by combining several programs sampled from the programs database (favoring high-scoring ones). The prompt is then fed to the pretrained LLM and new programs are created. Newly created programs are then scored and stored in the programs database (if correct), thus closing the loop.

Interesting observations:

They choose a fast LLM over a good one (Codey, code generation LLM on Palm 2). The results in the paper are obtained using a total number of samples on the order of 10^6, which would take way too long otherwise.
They have empirically observed that the results obtained in this paper are not too sensitive to the exact choice of LLM, as long as it has been trained on a large enough corpus of code.

Only evolve the core part of a particular program:

Performance tends to improve significantly if we write the initial ‘solve’ program in the form of a skeleton (containing boilerplate code and previous knowledge of the problem in the form of a program structure), and only use FunSearch to evolve the critical part that governs its logic. The chart below shows an example in which the skeleton takes the form of a simple greedy algorithm, and FunSearch only evolves the priority() function that is used to make the greedy decision at every step. A fixed skeleton constrains the space of programs that can be discovered, but it improves overall results because it focuses the LLM resources on only evolving the critical part, instead of also using the LLM to recreate already known program structures (with more opportunities for mistakes that would render the entire program incorrect). More precisely: solve() is the function to be evolved by FunSearch, evaluate() is the function that checks if the solution works.

Use the islands model to force genetic diversity of programs:

When prompting the LLM, they feed several existing programs into the prompt and ask the LLM to evolve the code. So they keep a population of working programs in a database. However, they sort those programs into islands that evolve independently from each other.

To sample from the program database, they first sample an island and then a program within that island, favoring higher-scoring and shorter programs.
They let information flow between the islands by periodically discarding the programs in the worst half of the islands (corresponding to the ones whose best individuals have the lowest scores). We replace the programs in those islands with a new population, initialized by cloning one of the best individuals from the surviving islands.
In practice, they feed k = 2 programs into each prompt, as two functions lead to better results compared to just one, with diminishing returns beyond that. (That’s a surprisingly small number.)

The paper shows that the LLM finds a smarter algorithm to solve the cap set problem (finding the largest possible set of vectors such that no three vectors sum to zero).

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

(Nov 2023, link)

The paper attempts to answer medical questions in various benchmarks. Its idea is to prompt the same LLM by telling it to act as different experts in different clinical domains, and then summarize all experts’ reports. This is the workflow: (1) assemble experts based on the nature of the medical question, (2) ask each expert to study the question, (3) summarize all expert reports, (4) have the experts discuss all reports, (5) make a final decision.

The prompting is extremely simple. Here are the two prompts used to find experts: “You are a medical expert who specializes in categorizing a specific medical

scenario into specific areas of medicine.” and “As a medical expert, you possess the ability to discern the two most relevant fields of expertise needed to address a multiple-choice question encapsulating a specific medical context.”

Here is the prompt that asks the LLM to act as an expert: “You are a medical expert in the domain of x.From your domain, your goal is to scrutinize and diagnose the symptoms presented by patients in specific medical scenarios.”

That’s it. Below are the results. This framework is better than zero-shot. But it is not any better than few-shot with chain-of-thought and self-consistency. So it is kind of pointless.

Diversifying AI: Towards Creative Chess with AlphaZero

Aug 2023, here

The basic idea here:

Reinforcement learning is great at optimizing towards ML systems that solve a particular problem setup well. For example, Google’s AlphaZero learns how to play chess extremely well.
But AlphaZero then still turns out to be bad at solving chess problems, like Penrose positions (a collection of chess puzzles).
So the paper introduces a new idea: if it seems that AlphaZero gets “stuck” on a particular way of playing chess, why not train several versions of AlphaZero that all play chess differently, and then mix them?

So the paper uses this approach:

Train several AlphaZero versions that each start with a particular Penrose challenge (a difficult chess starting position).
Have all versions play in parallel.
Use some expert curation algorithm to pick the best model’s next move, for each move.

This works really well: a) the AlphaZero versions trained on the Penrose set become really good at solving these particular positions, but don’t generalize to other positions, but b) if you pick from the best model in each situation, you get much better overall performance.

12: Practical Applications, Part II →