Matt Ambrogi

How to Evaluate the Quality of LLM-based Chatbots

Strategies for programmatic and qualitative evaluation of chatbots build with GPT and LlamaIndex

Why I care about evaluating chatbots

As a part of Buildspace Nights and Weekends, I’m currently working on exploring ways to reliably improve the performance of data-supported chatbots. I’m attempting to improve the performance of one particular application, my Legal Tech Bot, in order to find strategies that might be more broadly applicable.

First, a few clarifications. I’m focused on chatbots built on top of LLMs / GPT. When I say data-connected, I am referring to chatbots which have a means of passing context to GPT from a dataset of your choosing. In my case, that is a large set of blog posts and articles. LlamaIndex is a popular tool for this purpose.

I’ve previously outlined four types of questions that I think data-supported chatbots struggle with. Briefly, these are:

Questions that need to provide the most recent answer from the data
Subjective questions
Generic / high level questions
Questions that require aggregation of facts from the data

I plan to tackle each of this issues one by one with the goal of finding a reliable way to mitigate them. But before I set out, I needed to answer one question: how I will evaluate if my bot is getting better?

I spent the last few weeks exploring this question. This purpose of this post is to share what I’ve learned about evaluating the output of data-supported chatbots built with LlamaIndex.

I’ll first share some high level information on the variety of ways we can think about qualitative and programmatic evaluation of chatbots. I will then share details on how I’ve chosen to go about evaluation for my project.

Everything I’ll discuss below will be from the context of the Legal Tech Bot, which is built with LlamaIndex and GPT. This post will be most useful for anyone using that stack, but will also cover general considerations that should be useful for anyone working on chat based products.

Before I dive into exploring how to evaluate responses, I’ll remind the reader that a chatbot build with LlamaIndex and GPT works something like this. An engineer gathers a set of documents they want to use as reference, LlamaIndex creates a quick way to search through their documents. When a user asks a question, LlamaIndex tries to find the most relevant context in all of the source documents. It then passes both the context and the question to GPT, which generates a final response.

Chatbot evaluation at a high level

What is a good response?

My Legal Tech Bot is designed to be a sort of guru for the industry. It should be able to intelligently answer questions relating to industry trends, funding rounds, specific products, and just about anything else one might ask about the industry.

I chose this space because I knew I could get good data to support my bot and the industry size is relatively constrained. But more importantly, I know enough about it to easily evaluate answers. Playing with the first version of the bot, I was quickly able to tell what answers were good or bad and to identify predictable ways in which the bot wasn’t so good. As most people building in this space will do, I formed my first judgements of performance based on intuition. I call this qualitative evaluation.

How is qualitative evaluation done?

At a high level, we can either evaluate output qualitatively or programatically. In traditional machine learning, one would focus exclusively on the latter. But when it comes to generative machine learning, programmtic methods may not offer a dramatic advantange. I’ll touch more on this later. For now, we can consider two qualitative strategies.

Strategy one: form an intuitive opinion

Intuitive evaluation is simply done by asking a lot of questions and getting a feel for whether your bot’s responses tend to be good or not. You could take this a step further by asking a lot of questions and manually tracking the rate of what you judge to be ‘good’ responses. This doesn’t sound very scientific, but we’ll later see that programatic methods are effectively doing the same thing. If you are building a chatbot, you’ll form this intuition anyways. So I think it’s worth paying attention to how your internal rating changes over time.

The reason you can’t exclusively rely on this is that it’s too easy to lie to yourself. Any chatbot will do better with certain types of questions. It’s likely that you’ll find yourself asking more of that type, and less of the types it isn’t so good at answering. Given that, you’ll form an overly generous evaluation of the bot’s quality. The only solution is to find users who are well suited to evaluate your subject matter, and get the product in their hands. Even better if they can send you examples of use. They’ll undoubtedly show you new ways in which your bot is good and bad. Even better, they’ll teach you about what types of questions someone who isn’t you would want to ask. You can start doing this with just a few beta users. If you have a live app with more users, you can take it one step further.

Strategy two: thumbs up

If you’re building a chatbot and gaining users, it makes sense to set up a way to get input from others. If you use ChatGPT, you’ve probably noticed little thumbs up / thumbs down next to a response that allows users to report whether the answer was helpful or not.

If you were to pick one key metric, this is a great candidate. But again, I wouldn’t suggest relying on it exclusively. There’s lots of bias to pick apart when interpreting what types of answers would lead a user to give a thumbs up or thumbs down. More direct signals of usefulness might be number of questions per user per session or average session duration.

How is programmatic evaluation done?

Coming from a more traditional analytics and ML background, I wondered early on how I might evaluate quality programmatically. I wasn’t sure it was even possible. It certainly wouldn’t be as black and white as evaluation can be in traditional machine learning where one can rely on well established metrics like accuracy and precision.

In the case of traditional machine learning, we’re asking a program to evaluate a fairly simple question: does the predicted label match the true label. With generative AI, It’s not entirely clear exactly what question we want a program to evaluate for us. For a chatbot, I think it’s something like this: given the context, is the chatbot’s response a fair answer to the user’s question? Yes or no.

A fair programmatic estimate of quality would be to ask a bunch of questions and keep track of what percent of responses past this test. It’s not a perfect strategy however.

The programmatic evaluation is in itself highly subjective. For example, let’s look at the test I outlined above: given the context, is the chatbot’s response a fair answer to the question, yes or no. Well, what context? Will we judge this based on the context that was ultimately passed to the bot by LlamaIndex. Or do we evaluate this based on the entire index and all context that could have been passed? Here’s some other questions that illustrate how many different ways we might evaluate a response:

How is the judgement made as to whether the answer makes sense given the question and context?
Do we want to consider the context? Might we want our bot to also be able to use knowledge from outside of the context?
Do we want to consider the question during evaluation? If we want to measure how much the bot is making things up we might just care about the response in relation to the context.
How will we generate questions to run this test on? Do we need to worry about creating a biased set?

All of these questions are up to the engineer. There’s no right answers, although I suspect we’ll find some standards over time. A curious programmer might experiment with writing code to do these evaluations on their own. This could be a great exercise, but thankfully, it’s not a required one. LlamaIndex provides us with a set of tools to explore all of these options, which I elaborate on below.

Details on how I’m evaluating chatbot output

So far I’ve outlined high level ideas for how to evaluate chatbot output. But now I’ll share how I’m planning to do it. In short, I’m going to rely on both a programmatic baseline, and targeted intuitive checkins. I’ll run both of these analysis before and after each problem I tackle.

Programmatic Evaluation Process

I won’t be providing code snippets or implementation details. If you’re interested, I suggest checking out LlamaIndex’s great documentation, GitHub, and Discord for support. Instead I’m going to explain the high level usage pattern that I’m following given the evaluation tools that LlamaIndex provides.

LlamaIndex provides at least four different ways to evaluate responses, all of which could be mixed and matched. I’m using what LlamaIndex calls Binary Evaluation of the query, response, and source context. The docs state that, “this mode of evaluation will return “YES”/”NO” if the synthesized response matches the query + any source context.”

As I stated earlier, what I’m really interested in is this: for questions which my context should be able to answer, does the response provided by my bot make sense? This mode of evaluation gets us close to that. However we still need to make sure the questions we run this test on are questions that could, in theory, be answered by our source documents. One way to do this would be to run the eval as described above, and then for each question that fails, check every chunk of text in all the source documents to see if it could have answered the question. This would be very slow. LlamaIndex provides a better solution: a DatasetGenerator class. This class allows you to generate questions from your context. Because they’re generated directly from your data, we can reasonably assume they should be answerable by the context.

There’s one issue with relying on these questions for testing. They’re too good. If they’re generated directly from the context, our bot may perform deceptively well on them. User’s might ask questions that are generic, unexpected, or obscure. In order to solve this, I created a second test set by simply giving GPT-4 some context on my bot and asking it to generate a list of questions.

So given all of that, my evaluation process looks like this

Generate 100+ questions directly from my dataset using LlamaIndex’s dataset generator.
Set up a loop to run the YES / NO test defined above on each of these questions. Track the success rate defined by [# of YES] / [Total Questions]. This is the score I’ll use to evaluate performance.
Generate 100+ questions just using GPT-4.
Repeat step 2 on these questions.

Inside each one of these loops I am also building a data-frame with four columns: question, response, context (source context found by my index), evaluation (yes/no). There is one row for each question. That way I can review things visually. I save this data-frame into a google sheet to have on hand for future reference. Again, my core metric will be the percentage of questions that get a YES evaluation. Which is to say the percentage of responses that match the query + source context. In my first test the LlamaIndex generated questions scored 78% and the GPT generated questions scored 79%. I’ll be running this after every major change to my bot.

There’s many imperfections in this process. For example, I would actually like my bot to take a stab at answering generic legal tech questions that are not directly found in my source data. I’ve found GPT-4 does well on question a like “how is AI used in legal tech?” which may be too generic for the index to find a good piece of context on. Even with these imperfections, these tools prove extremely useful. I now have a means of running a programatic evaluation of the quality of my bot. The goal is to drive this score up. Equally important, this score also serves as something to triangulate my intuitive opinion of my bot’s performance against.

There is one big issue with the this process: each question evaluated requires an API call to your chosen LLM. This means evaluation is slow and expensive. This is not to mention the initial token-intensive process of generating questions. Hence the small number (100) of questions I suggest using above. If you are following this process for a production level product, I would suggest running your evaluation on far more questions. 100 should be enough to provide a meaningful estimate of quality for smaller projects. But, imagine one generates 1000 questions and breaks those into 10 sets of 100. The evaluation score for each one of these sets would vary, and it’s hard to say by how much. I’m hoping that clever engineering can solve this in the future. This is why I think it’s wise to utilize tools such as LlamaIndex instead of building data connection tools yourself.

Qualitative Evaluation Process

I previously outlined four specific types of questions for which my Legal Tech Bot does not perform well. I also explained why I think these are general issues to be expected for data-supported bots build with tools like LlamaIndex. I plan to tackle one of these at a time. As I do, I’ll keep tabs on the programatic evaluation. But I’ll also follow this process to qualitatively gauge performance:

Before attempting to fix the issue, collect a variety of examples of the issue. For example, inability to account for recency. Find these examples by trial and error. I also have a few test users who I’ll ask to help here. Users love poking holes in your product.
Go through the process of researching, experimenting with, and implementing a solution.
Test the questions from part (1) again. Ask test users to do the same.
Bonus: search the output of my programatic evaluation to see if there are any examples of this problem which previously received a NO and are now receiving a YES.

I’ll follow this exact process for each problem I work on. Combining this with detailed before and after notes will give me a strong sense of whether my bot is improving or not. This is far from a perfect science, but so is the programatic approach which effectively offloads the judgement onto GPT. The qualitative approach is an important way of detecting changes in performance that might be missed by programmatic evaluation.

Conclusion

Building high quality LLM based chatbots is still very much a dark art. Connecting them to your data dramatically improves their ability to provide factual information. But it’s not yet clear how we should judge the quality of a bot. This article is intended to share my thought process and the solutions I’ve landed on, which I believe makes the most sense given current tools. If you have any questions or thoughts on how to better evaluate data-supported chatbots, please reach out, I’d love to talk.