Matt Ambrogi

Evaluating LLM Summarization Quality

Using LLMs for summarization

Summarization is one of the most valuable and practical use cases of LLMs
In the simple case, it’s as easy as passing a block of text to the LLM and prompting it to return a summary
But in production, things become more complicated
For example, how should you summarize large documents? Imagine you want to provide summaries of 20+ page documents for users.
There’s a few common ways to go about this
- “Stuff”: just use an LLM with a large context window and stuff all the content in there.
- “Map Reduce”: break the large document into pieces, summarize each piece separately, and then have the model combine the summaries.
- “Refine”: break the document into pieces. Summarize the first chunk. Then summarize the summary so far plus the next chunk. Continue until you’ve iteratively built a summary of the whole document.
But how do you know which method is best? I was surprised to find that there is no agreed upon best practice. I couldn’t even find any research evaluating the relative performance of these strategies on large documents.
Not to mention:
- Does the model matter? For example, does GPT-4 produce meaningfully better summaries than GPT-3.5?
- How large of documents can the model handle before the summary quality starts to deteriorate?
- Is the model good at summarizing the type of content I need to work with anyways?

How do we evaluate summary quality?

We want to be able to test different approaches and compare their results. This leads to more general question: how should you evaluate the quality of an LLM generated summary?
Historically, there have been two common ways of evaluating summarization tasks.

ROUGE

ROUGE is a scoring method which evaluates overlap between an AI generated summary and a source of truth summary.
Looks for presence of exact words / sequences of words.
Intuitively, this does not align very well with human judgment.

BERTScore

Evaluates similarity of generated summary and source of truth using embeddings.
Improvement over ROUGE by enabling evaluation of semantic similarity.

Using the model

But it seems there’s a better way… Just use the model on itself
Seems hacky on first impression, but intuitively it makes sense that this might work well. LLM uses embeddings but also has huge amount of knowledge about the world encoded in itself.

G-Eval

G-Eval paper shows that LLM powered evaluations have higher correlation with human judgement than all previous methods
This OpenAI cookbook shares prompt you can use to get model to evaluate
- Relevance (1-5): Inclusion of important info from the source
- Coherence (1-5): Overall quality of sentences and structure
- Consistency (1-5): Factual alignment of summary and source
- Fluency (1-3): Quality in terms of word choice, grammar, spelling, etc.

Here’s a fragment of that prompt:

# Metric 1: Relevance

RELEVANCY_SCORE_CRITERIA = """Relevance(1-5) - selection of important content from the source. \The summary should include only important information from the source document. \Annotators were instructed to penalize summaries which contained redundancies and excess information.

"""RELEVANCY_SCORE_STEPS = """1. Read the summary and the source document carefully.2. Compare the summary to the source document and identify the main points of the article.3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.4. Assign a relevance score from 1 to 5."""

Using G-Eval Pat 1: Dataset

Given the example from OpenAI above, it’s easy enough to set up some code and run G-Eval on a given datapoint. But for any sort of useful testing, you’re going to need more than that. Step one is to set up a dataset
Unique thing about G-Eval is that is does not require a source of truth summary. Instead of needing a dataset of Text / Text Summary pairs, you can simply use a dataset of Text you’d like to summarize. You simply generate the summary using the method you want to test, and then pass the original body of text + your generated summary to the eval.
That being said, I think it’s still better to use a test set with source of truth summaries if you can find one that is similar enough to your use case. Between Kaggle, Hugging Face, and Google there are some great summarization datasets out there, typically used for fine-tuning models. I was able to find one which had documents that I felt were similar in content and structure to my use case. So I adjusted the system to generate a summary and then compare that to the source of truth summary.
There are many ways you could go about getting a test set. You probably don’t need more than ~20 datapoints to start. The evaluation process takes a long time given dependence on API calls and requires some manual review, so smaller but higher quality may be better.

Using G-Eval Part 2: Testing loop

I had an evaluation method and a dataset, but no further guidance on process. So I set mine up as follows:
I defined a few different combinations of model and summary method (i.e. stuff) that I wanted to test. Then for each combination, I looped through my test set, generated a summary, and ran evals using the base summary and generated result.
More specifically, my code did the following
- Go through each test doc, generate a summary
- Use generated summary and source of trust to get the following metrics
  - G-Eval Scores
    - Relevance, Coherence, Consistency, Fluency
  - Time to completion
  - Total tokens used
  - Input length
  - Output length
- Append each result to a data-frame which I then saved as a csv.

This outputs a table where each row is a summary with the following columns:

How this is useful: insights

This type of analysis helps provide concrete insights before launching a product

On the technical side you can figure out:

Which summary approach works best
Which model works best
What is latency like for your use case. Is streaming critical?

On the product side:

How do you need to adjust the prompt to get the style you want?
The the response the right level of information density for your use case?
Does quality hold for the size docs you expect to summarize?

Putting this information together leads to insights which should effect your configuration for production. Maybe you find GPT-3.5 works just fine for your use case and you can save money by not using GPT-4. Maybe you find the quality of your summaries is consistently low and your use-case requires fine-tuning. Or maybe you realize the summaries are too sparse, and you choose to use something like a chain of density prompt.

Human in the loop

The last and most important part of the process, is to have a human simply read through some of the summaries and compare them with the original docs or source of truth summaries.

Automatic evaluations are directionally useful. They can be very helpful in giving you a quick read on whether a change makes any different in your system, or to help you pick out certain types of docs which are not summarizing well. But they are infinitely more valuable when combined with just taking the time to read and analyze yourself.

In this process you may find that you actually like a certain type of summary that doesn’t score as well better. Or you may notice the model often includes some additional chat content along with the summary.

Ship it

All of the data gathered in this process ultimately has little value when compared to user feedback. This process is useful for ensuring your product doesn’t totally suck before shipping it. But shipping it, talking to users, adding a feedback mechanism, storing outputs, etc. - these are the things that will really tell you what you need to change. Then you can try out a new solution (model, prompt, algorithm), use your evals as a quick gut check, ship again, and repeat.

References

- Main source of inspiration: OpenAI Eval Cookbook