Does Accounting for Recency Improve the Performance of LLM-based Q&A?

 

What is recency filtering and why would I use it?

LLM-based Q&A on a set of documents works by taking a user’s question, finding the most relevant chunk of text within the documents, and feeding that context to the LLM to aid in response generation. I’ve been exploring whether it is useful to account, not just for similarity,  but also for recency when searching for the right text to pass to the LLM. This primarily applies when the context data is large and aggregated over a period of time, such as a collection of blog posts or news articles.

My experiments show no evidence that accounting for recency improves the performance of these data-connected chatbots. This post will detail why you might be interested utilizing recency in the first place, and what I found in my experiments.

One problem I noticed early on when experimenting with building these bots was that they often get questions which require the most up to date answer wrong. The example I like to give is to imagine you are trying to build a chatbot that knows everything about the company Microsoft, and so you connect it to thousands of blog posts about the company. You then ask it: “Who is the CEO of Microsoft?” The tooling then looks for a chunk of text in your data that seems very similar to your question and feeds to it GPT along with your question embedded into a base prompt that goes something like: answer the following question given the context provided.

This works well in many cases. But, a system like this may well respond “Bill Gates” to the question posed above. And it would have done its job in doing so. It found a similar chunk of text about exactly what you asked and synthesized an answer. But a human knows it’s the wrong answer. This is because the bot has no concept of recency. Whereas a human would know that if you ask who the CEO of Microsoft is, you mean the current one. Or in an ideal world you might like to hear “It used to be Bill Gates but Satya Nadella took over in 2014.”

This lead me to believe that some type of filtering for recency in the data provided to the model would be helpful. I use LlamaIndex to connect my data to the LLM. Recently, LlamaIndex released some tools for exactly this purpose. I’ve now run some experiments and, while I was optimistic, have found no evidence to suggest that they really help.

 

How it works

The LlamaIndex module works by first gathering the first k similar nodes to your query. It then sorts them based on provided metadata. In my case, they are sorted by date. It then picks n (n=1 by default) nodes to use for response synthesis. So if we keep n at its default value of 1 and set k=3, the index will find the top 3 similar nodes, and then use the single most recent of them to synthesize a response. Note what n and k represent here, I will come back to them.

 

Setup for my experiments

I’ve been using my Legal Tech Bot as the basis for all my experiments. It is a bot designed to know everything about the legal tech industry and utilizes data from 1000+ documents. This is what I’m trying to see if I can improve the performance of.

 

Recency Filtering Results

I set up recency filtering using LlamaIndex’s fixed recency node post processor. I had a small set of questions, which I knew from from trial and error were not getting proper responses from my current system.

What I found was that recency filtering helped for some of the questions, but not all. An interesting discovery was that whether or not accounting for recency helped for a given question depended on the value of k. For example, the following question, "How many rounds of funding does LawMatics have?” got a perfect answer with k=2. At k=3 however, the answer was totally wrong. For another question, all responses were wrong until k=7. I experimented on different questions and couldn’t find any one k value that seemed to consistently perform well. I’m optimistic that we’ll have tools that resemble hyperparameter tuning for these values in the future, but for now it’s guess and check.

These result aren’t shocking. Imagine k=5, our index will get the 5 most similar nodes from our corpus, filter by date, and use the most recent to synthesize and answer. But what if the 5th node is the most recent? Now we’re using the 5th most similar node to generate a response, which is trading similarity for recency. However, sometimes a higher k value will work well. An example would be in a set of documents that talk about a given topic in a lot of different places, where the up to date answer may be buried by more similar, but outdated nodes.

So from my quantitative testing, I wasn’t sold that recency filtering helped the system as a whole, largely because of this k value ambiguity. I then ran a programatic evaluation [link] over two different test sets of 100+ questions. I’ll spare the details except to say that, again, there wasn’t strong evidence that recency filtering improved the results of the chatbot.

 

Increasing n helps

At first I only experimented with the k value, how many nodes we collect and then sort. But I mentioned another, n, which represents how many of those k nodes we use to generate our answer after they’ve been sorted. By default LlamaIndex sets n=1. So we just use the single most recent node. As I outlined, we’re working with a tradeoff between node recency and similarity to the question. Could increasing n help? This means passing more context to the LLM and letting it figure out. In general I think putting more of the work on the LLM is a strategy that I think we can expect to work well.

From my experiments, increasing to n=2 helped in almost all of the cases I described above and never led to worse performance. In once case, a question that previously had incorrect responses for all value of k now got the right response from k=1 to k=4. So effectively a higher n seems to take the pressure off getting the k parameter tuned perfectly, which had been my biggest issue. I did not run a programatic eval with n=2, but am confident it would improve results to some extent when compared to n=1 recency filtering.

 

Does it make sense to use recency filtering?

If your context is very large and may contain outdated information for questions a user might ask, it’s worth considering recency filtering of this sort. I would suggest experimenting in the same way I have. Some have asked, why not just get rid of the older data? If you can, you should. In my case, I want my bot to know old information such as prior funding rounds, CEO’s, etc. Still, there probably are some creative avenues for data clean up.

I personally will not be incorporating recency filtering into my application just yet. I’ll have to accept the known issue of some temporal questions getting incorrect answers (i.e. Bill Gates). But if committed to setting up recency filtering for my bot, I would also have to commit to finding the right values of k and n. What I’ve found is that the values that might lead to a correct answer for one question, may cause another to give a worse answer than before. For me the tradeoff between recency and similarity is not clearly beneficial to the system as a whole.

If I felt it was critical to implement, I would first run programatic and qualitative evaluations on a variety of permutations of n and k. This is time intensive and out of scope for the work I’m doing at this moment. In the near future, I hope we will have tools to help developers find optimal values for parameters like these.

To build with LLMs, and the ecosystem of tools around them, is to sail in uncharted waters. That’s what makes it challenging and that’s what makes it exciting.