Chunk Size Matters

Intro to Chunk Size

Does chunk size matter for the performance of RAG (Retrieval Augmented Generation) systems? It seems to.

Let’s say we’re building a RAG system utilizing a collection of documents that we want our LLM to have access to. The retrievals step of our system looks for the appropriate information to pass to the model. This will be used as context for generating a response to a query. But this search process doesn’t look over all of the raw documents. Instead, we first chunk the documents into smaller pieces. This helps optimize the process of searching for relevant parts. Chunk size, typically measured in tokens (128, 256, etc), refers to how big those chunks of text are.

Usually, if a RAG system is not performing well, it is because the retrieval step is having a hard time finding the right context to use for generation. Because of that, we want to explore all the levers we have to improve the retrieval performance. It turns out having the right, or wrong, chunk size can make a big difference.

 

Can we find an optimal chunk size?

I wanted to see if I could experiment and find a specific chunk size that tended to work better for my use case. I was. Here’s how.

I have a good test RAG system to experiment with. It’s a chatbot set up to answer questions using thousands of blog posts about the legal tech industry. I built this using LlamaIndex, which is, at the moment, probably the most advanced framework for RAG. LlamaIndex just released a tool called Ensemble Retriever. This tool allows the developer to set up a number of slightly different search systems. At query time, the system will then try each in parallel and rank their relative performance.

I utilized this to do a sort of hyper-parameter tuning on chunk size. I set my ensemble retriever up to have 4 different search systems. Each having exactly the same attributes, except when it came to chunk size. Each was assigned one of the following chunk sizes: 128, 256, 512, or 1024.

I then looped through a set of 75 test questions and kept track of which performed best. In my case I did this by calculating the Mean Reciprocal Rank for each chunk size on each query. MRR is a means of evaluating the effectiveness of a retrieval system, when 1.0 is the best, meaning the first result retrieved is the proper one, and 0.0 is the worst, meaning none of the retrieved results are correct.

Over 75* questions, here’s how many time each respective chunk size scored a 1.0:

128: 48

256: 4

512: 4

1024: 2

 

As we can see, 128 far out performed every other chunk size.

Here’s the mean score for each chunk size:

128: 0.84

256: 0.28

512: 0.24

1024: 0.29

 

Again the picture is clear that 128 far out performed the other options.

 

Take aways

So what are the take aways? I think the key thing to understand is that, in general, a smaller chunk size helps your retrieval system find the relevant piece of context. The tradeoff, which is not demonstrated by this experiment, is that the generation step may suffer from a lack of context. If we are passing too little information to the LLM, it may struggle to generate an answer that makes sense in your context.

When building RAG systems, they are rarely engineered with a set chunk size, but rather a max chunk size, with the specific chunks dynamically changing to fit the text. I don’t think these results mean that one should always set max chunk size to 128. But rather they suggest that through testing such as this, one can see what magnitude of chunk sizes tend to work best for their system. One might choose to use that as a max, or might choose to use something like an Ensemble Retriever, which can utilize a variety of chunk sizes and return the best result.

I would also be weary of putting too much faith in relying just on the MRR. Rather, it would be useful to test these chunk sizes against a more robust, system level evaluation which looks at the query, retrieved context, and generated response together.

The key take away is chunk size makes a difference, smaller is often better, and you can experiment to find out what works for your use case.

 

 

 

Notes:

* These numbers do not add up to 75 due to some questions erroring during the evaluation loop