Building a Superagent: How we’re using AI to explain healthcare, faster

By Pál Takácsi, Sauhard Sahi, and Rushi Shah

Getting answers to basic health insurance questions can be harder than it should be.

Members reach out with questions about whether a procedure is covered, how much it might cost, or if they need prior authorization. Answers to these questions depend on many factors - networks, benefits, reasons for visit, diagnoses; often several different scenarios are possible.

For Oscar’s Care Guides, the process of answering these questions can be time-consuming, both because of the many pathways a care journey can take and because of the lack of tools at their disposal to diagnose and respond to complex questions. These gaps remain even despite the transparency of Oscar’s end-to-end in-house tech stack and real-time claims adjudication and simulation.

This led us to ask: How can we give our Care Guides an AI assistant to help answer these questions instantly?

Architecture

Our Superagent is an orchestration of language models and internal APIs. We layer intent classification, targeted information retrieval, calls to internal endpoints, and answer synthesis, so that the system is able to produce responses with supporting citations even when source data comes from different systems.

The initial prototype launched with a select group of ten Care Guides. It focused on assessing the tool's utility in real-world scenarios by providing a co-pilot to Care Guides to help them answer member questions about their insurance benefits. We saw strong results: an overall satisfaction rate of 82.6%, measured through direct Care Guide feedback.

Benchmarking Performance

While the initial pilot provided positive signals regarding usefulness, the complexity and seriousness of healthcare requires a robust method to measure the accuracy of Superagent's responses. We implemented a comprehensive quality evaluation program. This involved:

  • Generating responses to historical questions: We leverage a vast dataset of past inquiries to create a comprehensive test set for Superagent.

  • Developing a detailed audit framework: This framework allows us to systematically measure the groundedness, relevance, clarity, and completeness of the answers provided. (See below for definitions of these metrics.)

  • Employing a hybrid auditing approach: We utilize large language models as judges to evaluate the relevance, clarity, and completeness of answers, while human auditors verified groundedness to ensure factual accuracy and proper citation.

We defined the following metrics to measure the efficacy of Superagent answers:

  • Groundedness: This metric assesses the factual accuracy of the output. An answer is considered “grounded” if it’s completely correct and factual, based exclusively on the relevant documents and resources utilized by the system.

  • Completeness: This gauges whether the output provides a full and comprehensive response to the user's inquiry, leaving no part of the question unanswered.

  • Relevance: The output is deemed relevant if it contains information that is pertinent or directly applicable to the user's needs, concerns, or questions, without extraneous details.

  • Clarity and Tone: This evaluates the readability and user-friendliness of the output. The response must be clear, concise, member-friendly, and free from jargon or references to any internal tools or services.

Measuring Effectiveness

To set a practical standard, we established a Care Guide (CG) Benchmark, representing the performance of our internal human Care Guides who traditionally answer these questions across a set of ~400 historical questions.

The "CG Benchmark" in the table below highlights a significant observation. Looking solely at the limited number of samples in our benchmark tests, human Care Guides achieved 90% Groundedness, 82% Relevance, and 99% Clarity/Tone. However, their Completeness stands at 62%. 

This lower completeness likely reflects the complexities of healthcare. Because Care Guides do not have the tools and training to dynamically iterate through different scenarios and statistically price out different care pathways, they often have to ask questions members may not know how to answer, such as “what CPT code will likely be issued by your doctor?”

We count these answers as “incomplete.” Based on our member satisfaction scores, one of the things members like best about Oscar is our ability to route them to the right care and providers. But we know that AI can help us increase their satisfaction further. Superagent is designed to do unlimited work for the member. It tirelessly processes vast amounts of data, cross-references sources, simulates different probable pathways, and synthesizes complete answers without human limitations, aiming for comprehensive, one-and-done resolutions.

Our Journey to Launch: Performance Batches

The table below showcases the performance of our Superagent system across various development batches, as we moved closer to our launch thresholds: we deployed a new version live, we reviewed all answers, we made fixes, and we started again. These cycles needed to be fast, roughly two weeks each. They needed to be iterative. Each batch represented a significant set of questions used for evaluation.

Through extensive audits, we identified and addressed long-tail issues, continuously refining the bot's accuracy. The results from our rigorous testing phases were instrumental in guiding Superagent's development. With each audit, we fixed a long tail of issues and consistently improved our scores.

Batch No. of Questions Groundedness Completeness Relevance Clarity/Tone
Care Guide Accuracy Benchmark 422 90% 62% 82% 99%
Success Threshold 90% 90% 90% 99%
1 451 84% 92% 99% 100%
2 432 89% 90% 99% 100%
3 449 93% 93% 99% 100%
4 451 96% 92% 99% 99%

How Superagent Works

The agent architecture, visualized in the accompanying diagram, operates in several stages. It leverages the LLM to extract information from the member question and uses both pre-indexed knowledge (RAG) and real-time Oscar system integrations (via function calling) to get the right answer.

Section 1: Indexing the Knowledge Base for fast retrieval (Step 0)

For many inquiries, the core source of truth resides in extensive, relatively static data sources. These include:

  • PDF documents: Preventative guidelines, Evidence of Coverage (EOC), and Schedule of Benefits (SOB), which define the general structure and policies of a member’s plan. Note that some of these documents are plan specific, so we need to index all the variants and use the right one for the member when looking up information. Our systems have API calls that dynamically return the correct documents.

  • Taxonomy trees: These comprise codes and associated descriptions, like CPT (Current Procedural Terminology), HCPCS (Healthcare Common Procedure Coding System).

The volume and complex structure of these documents and code sets make passing them in as part the LLM context impractical. So we index static data ahead of time. This process utilizes OpenAI embedding models and Retrieval-Augmented Generation (RAG) techniques to enable semantic search across these documents and trees. This allows our system to find the pages, sections, or codes pertinent to a given question. This index is regularly rebuilt to incorporate any updates to the source documents, ensuring our knowledge remains current.

Section 2: Initial Understanding – Intent Classification (Step 1)

As depicted in the diagram's initial input flow, once a member's question arrives at the Superagent, the first critical step is Intent Classification.

We leverage an LLM to classify the member's question. This classifier precisely determines:

  • Whether the question is related to benefits and falls within our answerable scope.

  • The specific sub-intent it maps to (e.g., pharmacy, billing, referrals, general guidelines).

If the initial safety checks determine the input cannot be handled or the intent does not match one of our specified intents, the system escalates the query to a human Care Guide.

Section 3: Parallel Expertise – Sub-Agent Execution (Step 2)

Once the question's intent is classified and confirmed as answerable, a group of specialized sub-agents are triggered to execute in parallel. This parallel execution significantly reduces latency. Each sub-agent is designed to be an expert on a specific data source or question type, leveraging LLMs to retrieve, summarize, and interpret data relevant to the user's question.

Here is a detailed look at each sub-agent

  • Provider Status (ProviderLookupAgent): This agent uses any mention of a provider and determines if a specific healthcare provider or facility is in-network. It calls Oscar's live Provider Services to identify the relevant facility and provider information. This provider information is also intelligently forwarded to downstream agents, such as the CPT and HCPCS agents, to facilitate the calculation of the most accurate cost share based on the mentioned location (e.g., inpatient vs. outpatient classification).

  • Benefit Category Directory (BenefitCategoryAgent): This agent clarifies which medical interventions and services are covered under a member's health plan. It organizes information by benefit group, specific intervention, and location according to a benefit taxonomy maintained by Oscar. The BenefitCategoryAgent, with the help of the LLM, intelligently predicts the most relevant intervention, requests location if necessary, and retrieves real-time cost-sharing information via the Benefit service. This is an Oscar API that can adjudicate shadow claims in real time in our claims system, so it will always act identically to an actual incoming claim.

  • Benefit Documents PDFs (BenefitDocumentAgent): This agent serves as a comprehensive reference for health insurance documents: the Schedule of Benefits and the Evidence of Coverage. It utilizes RAG over these documents to identify the most relevant sections specific to the member's plan. This agent extracts answers exclusively from these official documents, providing direct citations to guarantee accuracy and build trust. If a citation is used from this agent in the final answer a web link is provided to the cited part of the document.

  • General Guidelines (GeneralGuidelinesAgent): For broad inquiries that do not fit into a highly specific category, this agent provides answers based on general health plan rules. This agent typically handles information applicable across all plans and incorporates essential static rules that ensure consistent guidance.

  • Infer CPT codes (CPTAgent) & Infer HCPCS codes (HCPCSAgent): For inquiries involving medical procedures and durable medical equipment, these agents provide details such as CPT and HCPCS code(s) and the corresponding coverage rules, clarifying specific procedures or services and their coverage status.
    CPT codes are used to encode medical procedures and appear in medical claims to describe the procedures rendered. Oscar’s claim system is configured to assign coverage and cost share to CPT codes.
    The CPT and HCPCS agents, with the help of the LLM, identify the CPT and HCPCS code(s) that is likely relevant to the member’s question and then use Oscar’s backend services (Benefit Lookup and Cost Estimator Services) to get the cost share and coverage details such as prior authorization.

  • Perks (PerksAgent): This agent informs members about special perks and additional benefits available through their health plan, such as wellness programs or discounts. It also clarifies eligibility criteria for these valuable extras.

  • Pharmacy Drug Lookup (PharmacyDrugAgent): This agent provides clear details on prescription drug coverage. It identifies the relevant drug mentioned in the member's question and calls Oscar's live internal Drugs Service to obtain the most up-to-date and accurate drug cost and coverage information. It includes details on coverage, cost-sharing, and whether a specific medication requires step therapy or prior authorization. If a requested drug is not covered, it can also suggest covered variations or generic alternatives.

  • Preventive Guidelines (PreventativeGuidelinesAgent): This agent specializes in preventative care. It reviews our Preventative Care Guidelines to determine if a requested procedure qualifies as preventive. By cross-referencing a member's age and medical history, it provides a clear answer, complete with direct citations from the official guidelines for full transparency.

  • Prior Auth Contact Info (PriorAuthContactAgent): Certain services require prior approval. This agent simplifies this process by supplying precise contact information and clear instructions necessary to obtain prior authorization for specific services, thereby assisting members in navigating this crucial step within their health plan.

Each sub-agent runs independently. If a sub-agent is not confident in its own response (e.g., it finds no relevant information), it removes itself from the flow, ensuring that only high-confidence data moves forward.

Real-time Oscar System Endpoints

A cornerstone of Superagent’s accuracy and real-time capability is its direct integration with Oscar's live operational systems. This is a significant differentiator, allowing us to access the most current and personalized member data. The boxes (with blue color) in the diagram that represent calls to existing Oscar system endpoints are:

  • Provider Status (via ProviderLookupAgent): this is a set of services that return real-time information about a provider or facility’s network status, i.e. whether they are in Oscar network, what specialties they are credentialed to, etc. The same services power provider verification during claims adjudication and member facing provider search.

  • Benefit Service (via BenefitCategoryAgent): The Benefit Service is used to look up member benefit rules including co-pay/co-insurance, various exceptions, etc. The same service is used by our member site and claim adjudication pipeline.

  • Claim Service(s) (via CPTAgent/HCPCSAgent): This is the same service that is used to determine benefits for a CPT/HCPCS code on a claim line. The Superagent uses this same service to return the same result that a claim would get with the same code.

  • Pharmacy Drug Lookup (via PharmacyDrugAgent): These services return coverage and pricing information about drugs that are in Oscar’s formulary. The same services sit behind the member facing pharmacy related web pages and mobile app.

These real-time API calls ensure that information on a member's specific plan, current network status, and drug coverage is always up-to-date and accurate, reflecting any recent changes or personalized benefits.

Section 5: Orchestration & Refinement – Synthesis and Dialogue (Steps 3, 4, 5)

After the parallel execution of sub-agents, the system proceeds to synthesize the information.

  • Source Attribution (Step 3): Each successful sub-agent attaches clear citations to its output, linking back to the original data source. These citations are preserved throughout the process and in the final output for complete traceability and transparency. At this stage, a sub-agent can also, if necessary, identify a need for more information and ask a clarifying question to the member. For example, the cost share for a procedure often depends on the location, where the procedure is performed. If the location cannot be clearly inferred from the question, the Superagent will ask the member about the location.

  • Final Synthesis (Step 4): A dedicated Synthesis Agent (another LLM call), takes the high-confidence outputs from all participating sub-agents. Its primary role is to formulate a single, coherent, and comprehensive answer, if there are no pending follow-up questions. This agent is also intelligent enough to identify inconsistencies or prompt for further clarification if needed, before generating a final response. The diagram illustrates this as "Synthesize remaining source-level answers into final answer 'package'."

  • Multi-Turn Dialogue Support (Step 5): The system is designed for natural conversation. If the original question lacks necessary detail, either a sub-agent (as mentioned above) or the Synthesis Agent will prompt the user for clarification. Once the user provides the additional context, the relevant agents (and the synthesizer) re-run with this new information to refine the output, ensuring a precise and complete answer. The diagram shows the "Vibe-check follow-up question" loops.

Ultimately, after all clarifications and syntheses, the Superagent determines its confidence in the answer package. If the LLM feels confident about the answer, it is returned to the member; otherwise, it is escalated to a human care guide, ensuring members always receive reliable assistance.

What’s next

With the Superagent already answering benefit questions directly to members, we are now continuing to add a new use case every two weeks or so - from the bot taking real-world actions such as issuing new ID cards, to explaining physician choices better, and so on. We are also extending the Superagent to voice: members calling in will be able to directly converse with the bot in the near future.

We’re also excited about the evolution of the underlying models. With the launch of GPT-5, we’ve already begun experimenting with using it to code some of our new features, and look forward to using it to unlock Superagent functionality going forward.

We believe that information in the entire US healthcare system should be aided by and accessible through AI. The opportunity to reduce costs and increase quality is massive.

A huge thank you to all the team members who helped move this project forward: Bebe Silberzweig, Gina Yoo, Taz Zorrilla, Gabriela Lawrowska, Herry Pierre-Louis, and Alex Miller.

Interested in building tools like Superagent? Check out our open roles.

Next
Next

Using AI to fix one of healthcare’s most overlooked bottlenecks: getting documents to the right place