Call Summarization: comparing AI and human work

By Lei Zhao

Summarization is considered a top use case for generative AI. Our call quality team ran a side-by-side evaluation, and you’ll see the results from these real calls. The initial findings show that AI performs comparably to our Care Guides in summarizing calls overall, but this performance isn’t evenly distributed. Although the AI summaries were comprehensive, accurate key information identifiers can sometimes be a challenge.

Background

Currently, Care Guides manually summarize ~1 million calls annually from members, which are documented in service call notes. This documentation is required to establish context for downstream users like other Care Guides, auditors, and operational teams handling escalated issues. Inadequate or missing documentation may make it harder for downstream parties to efficiently resolve member issues. This manual work can be time consuming, taking several minutes per call — and notably longer for more complex calls. Our AI call summarization aims to minimize the manual summarization burden on Care Guides (both in terms of time spent and cognitive load). Minimizing the manual burden for Care Guides enables them to spend more time supporting our members. A similar solution can be deployed across a range of other front-line teams, both clinical and non-clinical, with Care Guides as an important proof point.

Illustrative high-level overview of manual vs. AI-enabled note-taking workflows. We anticipate note taking for Care Guide’s own tracking purposes on a call will continue regardless of AI summaries.

Findings

In the prototype phase, we collaborated with our Quality Improvement team to devise a grading rubric they used to assess AI- and Care Guides-generated summaries, for an “apples to apples” comparison. (The Quality Improvement team’s job is to regularly sample and audit calls.) The two versions of summaries were graded on these four criteria:

  • Accuracy: the summary accurately captured the facts of the call

  • Relevance: the summary did not inject spurious, irrelevant topics/points of discussion

  • Clarity: the summary produced was readable and easily understandable

  • Completeness: the summary captured the major points of the call

While our Quality Improvement auditors weighed each of these categories equally, we were particularly interested in accuracy and relevance of the AI summaries, i.e. the AI got the facts right and did not make things up.

Using this rubric across a sample set of calls, we saw the following results in terms of errors per summary (total errors/total calls):

Takeaways

AI summaries are arguably achieving parity with Care Guide summaries.

  • AI versions had 8.24% fewer errors vs. Care Guides across all 4 categories

AI performance isn’t better across the board vs. Care Guides summaries.

  • AI overperformed vs. Care Guides in the completeness category

  • AI underperformed Care Guide summaries in accuracy, and minorly underperformed in relevance and clarity.

AI summaries did not contain hallucinations (which is when an AI model generates incorrect or made up information but presents it as fact).

We analyzed the accuracy underperformance noted above by reviewing both the transcripts that fed into the AI summarization service as well as the call audio and found the following underlying causes for accuracy issues:

  • If source transcript quality was poor, particularly for Spanish speakers.
    - This was especially true when there is an interpreter involved with an English-speaking Care Guide and Spanish-speaking caller.
    - Idiosyncrasies in Spanish, such as when a caller spelled out a doctor’s name “Luaces” as “L-U-A-C de casa-E-S” caused issues for transcription. In this case “C de casa” is the equivalent of saying “N as in Nancy” in English, but transcription services had trouble handling this.
    - This is a good example for how models need to always be viewed in context: here, potential bias of the model chain doesn’t come from the summarization model, but it emanates from the transcription model.

  • If there were mispronunciations by callers/agents of names of doctors and those were “accurately” transcribed and were ingested as transcribed by summaries. Member mispronunciations like this can be a challenge even for our Care Guides, too!
    - A caller said their last doctor’s name was “Abu”, when it was in fact “Nabut.”

  • Another interesting tidbit is that agents’ summaries still tended to be better at capturing actual entity names (like physician or hospital names). This is an almost accidental consequence of their workflow: agents pull up physicians, hospitals, drugs and other actual entities in our tools while talking to the member, and then copy and paste the names into their notes. (We have all that clickstream data and thus could make this part of automatic note-taking.)

  • Generic summaries didn’t pick up on details like payment amounts, such as if a Care Guide informed a caller that an expected copay for a visit would be $10. This is an example for further quality improvement that can come from, say, tuning out-of-the-box LLMs on the particular details that are important in a particular context.

We are deploying improved transcription (obviously an area where state of the art has advanced rapidly over just the past few months), and are working toward rolling out automated documentation functionality across a range of teams.

Previous
Previous

AI Use Case: Messaging Encounter Documentation

Next
Next

Enforced planning and reasoning within our LLM Claim Assistant