The Issue with Data Supported Chatbots
The power of chat bots that can access your data
We’re in the midst of a surge of chat-based products, built by developers who are excited by the possibilities GPT has unleashed. But one thing that anyone who spends some time playing with ChatGPT quickly realizes is that these models make things up.
One way to solve this is to show GPT some contextual info before asking it a question. But until recently, developers were only able to provide a very small amount of context due to API constraints. With GPT-4 this context size has expanded dramatically, but a user still couldn’t do something like pass GPT an entire book to reference. Even if they could, it wouldn’t be wise from a performance stand point. It would take forever, and cost a lot, to get responses.
It would be nice to be able to connect GPT to a repository of data on a specific subject so that it can answer questions about that data accurately. It would be really great it we had a way to just show GPT the part of that context relevant to a specific question so that it doesn’t have to search it all. Thankfully, a few tools have been built for exactly this purpose. I’ll focus on one called LlamaIndex because it’s very popular and what I use.
LlamaIndex allows you to ask a GPT a question, and have the model shown the relevant info from your data before answering.
If you ask GPT out of the box who won Song of the Year at the Grammys in 1984 it will likely make something up. But let’s say you collect all Wikipedia Entries on the Grammys and use LlamaIndex to connect them to GPT. Then, when you ask the question, LlamaIndex will usually be able to find this section in your collection of Grammy Wikipedia posts and show it to GPT along with the question. Now it will provide you with the correct answer.
User: Who won best new artist at the 1984 Grammy's?
Assistant: The 1984 Grammy for Best New Artist was won by Cyndi Lauper
Convincing, but wrong.
Quincy Jones (producer) & Michael Jackson (producer)
Thriller – Michael Jackson
Quincy Jones (producer) & Michael Jackson (producer)
Question: Who won best new artist at the 1984 Grammy's?
Assistant: Culture Club won the award for Best New Artist at the 1984 Grammy's.
The applications of this connection between data and LLM's opens up new doors. Applications might involve customer support, or documentation assistants for developers, or company knowledge bots. People have even used this to make bots which allow them to 'talk' to their favorite podcast hosts or authors.
The general pattern for building one of these is the same no matter what data you use. It basically follows this pattern:
- Collect a bunch of text about something (all transcripts of a podcast, all blog posts by an author, a book, etc)
- Use a tool like LlamaIndex which breaks this into chunks and creates a means of quickly finding chunks relevant to a given question. (some people do this step themselves)
- Feed those chunks and the question to GPT and get an answer.
It turns out this is very easy to get working if you have some Python experience. You can follow this tutorial by Dan Shipper and have a bot running in 30 minutes.
But once you start playing with one of these bots, something quickly becomes apparent: they are pretty bad. The more you play with it, you’ll see that they’re bad in all sorts of predictable ways. By predicitable, I mean you can pick up on the types of questions it will give you crappy answers to. By crappy, I mean not what you would expect if you asked a human who was an expert on the subject the same question. Because of that, these bots are largely considered fun toys, as opposed to something you would use in a production application.
Making bots that know everything about something
In the process of making some of these bots, I’ve started to notice commons ways in which they fail. Again, by fail I mean give bad answers. I’m going to base the rest of my writing here off a real project, which is helpful for my own exploration and for providing clearer explanation. I recently read a post by Bill Gates which included this passage
That made sense to me. I don’t have access to enough of any one company’s data to try to build an agent like that. However, I wondered - if these could be built for certain companies, could they be built for entire industries? Imagine the type of person an investor or journalist might reach out to who is trusted to provide accurate information about an entire industry. A guru of sorts. The type of person you’d have to pay a lot of money per hour to ask a few questions.
Gates goes on
Could I nail that industry news part to start?
I decided to try. I decided to try to make what I’ll call the Legal Tech Guru. I chose legal tech for two reasons. First, I knew where I could get a bunch of data and blog posts about it. Second, I know the industry well (I work in it) and could easily evaluate the quality of responses. Here’s some examples, in my mind, of questions a user should be able to ask the guru.
- How much funding does Company A have?
- Tell me about 5 companies who do ____
- What does Company XYZ do?
- How is AI being used in legal tech?
I want to mention that I’m not pursuing this project in the way I would if my goal was just to make a product people wanted. I have no end user in mind. I’m not trying to get it in front of people to validate whether anyone wants a legal tech bot. I could imagine how it might be useful. But I started just out of curiosity to see if it could be done. But then came the question, how do I make this thing good? And that’s what I plan to explore. In doing so, it would be a wonderful side effect if I unearth some answers to the more general question of how do we make these types of things (data backed, LLM based chatbots) good?
I wasn’t planning to depend on myself for the answer to this question. But looked all over online and chatted in discords, I’ve found that that the answers are just not well documented yet. Building a good chat bot, or ‘agent’ as it’s become popular to call them, is still very much a dark art.
The ways in which data supported chatbots suck
Having laid out what I’m doing, I’ll now explain some of the ways in which my bot, the Legal Tech Guru, sucks. It seems to me that the things that suck about the Legal Tech Guru right now are not unique to how I built it, but to be expected by anyone who has built a v1 of an agent that relies on similar set of documents.
Briefly, before diving into this, if I’m going to trash my own bot and say it’s bad, I should define what good would look like. To me, the definition is simple. For a bot to be good it should, when asked a question, respond in the way you’d expect a human who knows a great deal about the subject to answer. Ideally, it should even be better.
Problem One: the agent only responds accurately when questions are worded in a specific way.
Here is a real example where I’ve just changed out the company name.
Q: What does the company GoodLegalCo do?
A: It is not possible to answer this question with the provided context information.
Q: What is GoodLegalCo?
A: GoodLegalCo is a practice management software that enables law firms to run from anywhere.
What’s up with that? Clearly it knows the answer. To a human ‘What does the company GoodLegalCo do?’ seems like a clear question. For humans, there are a variety of ways to phrase a question that fundamentally ask the same thing. For an agent to be human like, it should respond to all of those pretty similarly.
Problem Two: No consideration for recency.
This agent is connected to a large collection of news and writing from the past. When you ask it a question, it takes some context from that collection and uses it to form an answer. It looks for that context based on semantic similarity to your question. It has no concept of recency. So we should expect some weird behavior. I actually had the chance to ask the founder of LlamaIndex if he knew of any good ways to work around this. He told me that if he did it would have already been in the docs.
Let’s imagine for a second that our bot also has access to a bunch of information about Microsoft and we ask this question.
Q: Who is the CEO of Microsoft?
A: Bill Gates.
This is completely possible behavior. It gives an answer that once was true, and is stated somewhere in the documents it relies on. But it’s no longer true. It may also answer and say
A: Satya Nadella
That’s right. But if it answers that way it’s just became the most similar chunk of text in our corpus happened to say so. The most similar chunk of text could easily have said Bill Gates. No idea of recency has been programmed in.
Here’s another example.
For context, let’s say Hot Legal Startup was acquired by Big PE Firm in 2018. Big PE Firm then sold Hot Legal Startup to Big Co in 2020.
Imagine a user asks, ‘Tell me about Hot Legal Startup.’
Our agent tells us about what Hot Legal Startup does and includes that it was recently acquired.
The user then asks ‘Who was it acquired by?’
It is completely probable that our agent will answer ‘Hot Legal Startup was acquired by Big PE Firm in 2018.’ This is true. But it’s not the answer we would want.
If we talked to our human Guru about this, we’d expect that they would explain both acquisitions. This introduces another issue.
Problem Three: Aggregating information for answers
In the previous example, our agent will never answer ‘Hot Legal Startup was acquired by Big PE Firm in 2018. Big PE Firm then sold Hot Legal Startup to Big Co in 2020’ unless somewhere in our context these are mentioned close together. Which could be the case, but isn’t guaranteed.
To answer a question like this, we may really need our agent to compose an answer with info extracted from multiple pieces of relevant information. LlamaIndex has some means for handling this. But it uses a different means of searching your data, which will degrade performance on questions that don’t require this type of aggregation. So how should you think about that tradeoff?
Problem Four: Handling subjective questions
Out of the box, a bot build with GPT and a tool like LlamaIndex doesn’t really know how to handle subjective questions.
Remember, a good legal tech guru bot should respond similarly to a legal tech guru human. Let’s imagine we ask the human legal tech guru, ‘who is the best legal tech company using AI?’ They might answer something like ‘Well, it’s hard to say who is best. Is there a specific task or type of company you are interested in?’ Maybe they would go on to explain some things you should consider, and ultimately mention a few popular companies.
Our digital guru will not be so graceful. I asked mine this exact question. It responded:
‘it is difficult to answer this question without more information. It is possible that Casetext, Fastcase, Cravath, Swaine.’
Not great. It’s confused. For one, it stopped in the middle of the sentence. We can ignore that for now. The first two companies are legit. Cravath and Swaine are not. Cravath, Swaine, and Moore is a law-firm that has done some work with AI, but that’s not exactly what we’re looking for (which is actually a separate issue). The point is, there’s significant room for improvement. In an ideal world, it would suggest some companies and information for us to ask more questions about so that we can come to a conclusion on our own.
Problem Five: Lots more problems
The example above is an instance of the bot not answering subjective questions very well. But it’s also an instance of other problems. It’s on the right track that Cravath, Swaine, and Moore is a company that is in law and uses AI. But it’s a law firm using AI. Which isn’t exactly what I meant as a user. As with many examples of bad responses, you could categorize it in more that one way.
I’ve only shown bad responses here. But in fact, this bot is pretty good more of the time. The fact that we can so easily create a bot that answers a huge range of questions so well represents an enormous leap forward in technology.
It just turns out that there’s an enormous gap between giving pretty good, human sounding answers and answering all questions on a specific topic correctly all the the time. The latter might not even be possible. Either way, there’s definitely far more than four or five problems between where I am now and there.
I’ve decided to focus on the four problems I outlined above for three reasons:
- They pop up again and again
- They’re explainable given how the system works
- I think they can be mitigated
How we might we mitigate these problems?
What I plan to spend some time exploring and documenting, is how we might mitigate the problems outlined above. Without getting too much into brainstorming solutions, I’ll say that I think it’s possible.
For example, let’s take the issue of recency. If the documents the bot relies on include the date they were published, maybe we could store and utilize it. Maybe we can pass it to the bot and tell it to consider that information.
Maybe some of the options built into tools like LlamaIndex already solve some of these issues but that information isn’t well recorded yet. This might be the case for the problem of being able to answer questions which require aggregation. Other problems might just require some playing around. Maybe all it takes to handle subjective questions gracefully is a bit of prompt engineering.
I think it’s worth finding out. Because the answers are not well documented right now. And if we can land on some, then they will be useful in many more instances than my Legal Tech Guru.
Over the next six weeks I’ll be exploring the problems outlined above. I plan to focus on one at a time and see if I can find a strategy for mitigating it. Each week I’ll be sharing what I find. By the end of the project I hope to have two things: one, a Legal Tech Guru that works really well and two, a write up of my findings so that they can be of use to others. I'd like to tackle one of these problems a week, but I will be working on this during my nights and weekends and this is the type of work that is best left unrushed.
If you are interested in similar questions, have found a solution to any of the issues outlined above, or think I’m missing something, please reach out! You can find me on Twitter @matt_ambrogi.