Creating a knowledge base

You already know how to create an agent, now it's time to create a knowledge base. In order to do that though, we need to go over some simple concepts on how the agent interacts with collections.

Reading the LLM primer will definitely help you in the following steps.

As you already know, a collection contains documents. What you might not know is that those documents are chunked beforehand. Yes, even the small example from the introduction was chunked, albeit only one chunk was produced (the document itself) so you might not have noticed it.

Chunking a document is a way of breaking it up into smaller pieces so that they can be fed to an LLM. Pasting a 400 page PDF in a prompt is just not going to work. The prompt will be too long and the LLM will not be able to process it, hence we need to chunk.

Remember, the quality of an LLM's response is directly related to the quality of its input. For that reason, it's important to:

  1. Have a good selection of documents.

    If your agent is fed mumbo jumbo, you shouldn't be surprised it spits out mumbo jumbo. Unless you are creating an agent for entertainment purposes, (which is always fun), descriptive and clear-cut documents are advised.

  2. Ensure those documents are chunked in a manner that preserves their context.

    Due to the varying nature of documents, this is the tricky bit. Not all chunks have to be perfect, but they should generally be descriptive and retain the semantics of the original document.

The first step is really up to you and your choice of documents. For the second step, Ragu has a user interface where you can play around with document chunks in fast iteration.

For the remainder of this chapter, we'll be learning how to upload and chunk documents. You already know how to assign documents to collections, but you'll also learn what exactly happens during this assignment.

Uploading documents

On the admin page, in the sidebar you will see the Documents section. Clicking on it will take you to a page where you see all documents available, as well as a form to upload new ones. Clicking on the Upload button will open up a small window where you can drag and drop your files or use the built-in file picker to upload them. After selecting and uploading your document, you will be redirected to the document's page where you can configure how it will be parsed and chunked.

Processing documents

On the document's page you will see two sections; Parsing and chunking.

Parsing a document

A document's parsing configuration will determine which parts of the document you want to include in whichever collection you are putting it in.

For example, if you are uploading a PDF, you might want to skip the first or last few pages. Usually PDF documents have a cover page and a table of contents as their starting pages, but that information is not of particular use for agents, so you can usually skip it. You can configure all of this in the parsing configuration.

The parsing parameters are as follows:

  • start - Determines the number of pages to skip at the start of the document. For example, a value of 5 skips the first 5 pages of a document.
  • end - Determines the number of pages to skip from the end of the document. For example, a value of 5 skips the last 5 pages of a document.
  • range - If selected, instead of skipping start or end pages, it will select a range of pages to include. For example, if start is 3 and end is 5, it will include pages 3, 4 and 5.
  • filter - A list of regular expressions used to exclude certain parts of the document.

Chunking a document

Once you've decided which parts of the document you want to include in your collections, you can now configure how those parts will be chunked. This process involves a lot of trial and error. Due to the varying nature of documents, there is no magic configuration that will fit all of them, so you're going to have to play around a bit until you get the chunks you want.

The chunkers available are described in detail in the chunkers chapter, so we're only going to provide a quick overview here.

  • Sliding window - The most straightforward way of chunking a document, but produces the least quality chunks. You select a base size, i.e. how many characters will be in each chunk, and an overlap that each chunk has in regard to its previous and subsequent one. This one is useful for when whole documents can fit into one chunk and should rarely be used otherwise.

  • Snapping window - Works on the same principle as sliding window, except it's aware of sentence stops. This chunker, along with the semantic window, produces the best results for textual documents because it's aware of sentence boundaries and will not produce chunks that start or end in the middle of sentences.

  • Semantic window - Similar to snapping window, but groups chunks based on the semantics of the text. In other words, more similar chunks will be grouped together. It's worth noting that this chunker is embedding based, meaning that if you use a third party embedding service (such as OpenAI), it will spend tokens during previews and actual chunking.

Play around with different types of chunking configurations until you find one that suits the document in question. Once you're satisfied with the results, click on the Save button in the respective configuration sections to save the configurations.

Both the parsing and chunking configuration are applied whenever you add the document to any collection, so you only have to configure them once.

Assigning documents to collections

If you've followed the Creating your first agent section of the introduction, you've already done this step.

Once you're satisfied with a document's resulting chunks and have saved its configuration, you can assign that document to any collection you want. Whenever you open a collection's page, you will see a list of all documents assigned to it, as well as a menu where you can add new documents to it.

When you add a document to a collection, what you're really adding to the collection are its chunks. These chunks are then retrieved when you users converse with the agent and are used to enrich the agent's context. That's why it's important to have good chunks. If the chunks do not retain information clarity, then the agent will simply not have the necessary context to answer questions in a useful manner.

There are a few important things to remember when assigning documents to collections:

  • Once a document has been added to a collection, any changes to its configuration will not influence the existing chunks. If you want to update existing chunks, you will need to remove the document from the collection and re-add it.

  • If you delete a document from Ragu, its chunks will be removed from all collections.

Next you'll learn what an agent is and all its various settings you can adjust.