Documents

Documents are the building blocks of knowledge bases. Anything that adds value and enriches an LLM's context can be thought of as a document.

Why chunking is important

If a document is small enough, the whole of it can be used to enrich an LLM's context. However, if a document contains more than a few pages it is usually a good idea, and often necessary, to break it down into smaller chunks.

Imagine if someone were to ask you a question about dandelions. Do you think you would you be able to answer the question faster and more correctly if you had a whole book about horticulture, or just 5 excerpts from it specifically related to dandelions?

This is essentially how context enrichment works with LLMs and it is exactly why big documents need to be chunked. When documents are chunked, the LLM can retrieve only the most relevant chunks to its prompt instead of the whole document.

This is one of the reasons why it's important to chunk your documents, the other being the limited context window of the LLM. This is fancy talk for the fact that LLMs can only process a limited amount of words (more specifically tokens) at a time. If the LLM can only handle 100 words at a time, then giving it a 200 word document will make it cut off the first 100 words and the context will be lost.

Parser

Documents come in many different shapes and sizes. A parser is a tool used to transform various different document types to textual formats that are usable by LLMs.

Before we can actually start using a document, we must parse it. It is in this process that we specify which parts of the document will be added to the knowledge base.

A generic parser for any document type looks like the following (note that elements are document specific, e.g. pages in a PDF, paragraphs in DOCX, etc.):

Parameter	Description
`start`	The number of elements to skip at the start of the document.
`end`	The number of elements to skip from the end of the document.
`range`	If selected, instead of skipping `start` or `end` pages, it will select a range of pages to include. The range is always inclusive.
`filter`	A list of regular expressions used to exclude undesirable parts of the document, such as signatures and page numbers.

Additionally, parsers for specific file types are also available.

Chunkers

Ragu offers a variety of chunkers for splitting up larger documents into smaller chunks LLMs can handle.

The quality of the generated chunks is important. Imagine those 5 dandelion excerpts all started and ended in the middles of sentences. It would be a bit difficult to reason about the context from which the chunks were made.

Ragu chunkers are designed to preserve the most context while still being fast enough to enable easy prototyping.

Sliding window

The most basic of chunkers. Usually good for adding whole documents to collections as a single chunk if they are small enough. Usage is not recommended for most documents.

Parameter	Description
`size`	The number of characters to fit in each chunk.
`overlap`	The number of characters to overlap between chunks. For each chunk, an `overlap` amount of characters will be prepended and appended from the previous and next chunk, respectively.

Snapping window

Similar to sliding window, but aware of sentence boundaries. Very useful for prose and documentation. This is the default chunker used for newly uploaded documents.

Parameter	Description
`size`	The number of characters to fit in each chunk.
`overlap`	The number of sentences to overlap between chunks.
`delimiter`	The sentence delimiter (sentence stop) to use. Usually you want to keep this set to fullstop (`.`).
`skip_forward`	A list of patterns that make the chunker skip sentence stops if a pattern is trailing the delimiter. Useful for things like abbreviations, or when you don't want to treat particular delimiters as sentence stops.
`skip_back`	A list of patterns that make the chunker skip sentence stops if a pattern is leading the delimiter. Useful for things like abbreviations, or when you don't want to treat particular delimiters as sentence stops.

Semantic Window

Similar to snapping window, but groups chunks based on the semantics of the text. This chunker is also aware of sentence boundaries. It is important to note that when using this chunker, a text embedding model will be used to generate embeddings for each chunk and will spend tokens.

The chunker will first chunk the whole text in a similar fashion to the snapping window. Then the chunks are embedded using a text embedding model and the distance between each is calculated (using an arbitrary distance function). If the distance between two chunks is less than the threshold, they are grouped together.

Parameter	Description
`size`	The number of sentences to fit in each chunk.
`threshold`	The similarity threshold to use when grouping chunks. This is a number from 0 to 1. The larger the threshold, the more similar the chunks will have to be in order to be grouped into one.
`distance function`	The distance function to use when calculating the distance between chunks. Cosine distance is the default.
`delimiter`	The same as snapping window. Used for the initial chunking, before the similarity is calculated.
`skip_forward`	The same as snapping window. Used for the initial chunking, before the similarity is calculated.
`skip_back`	The same as snapping window. Used for the initial chunking, before the similarity is calculated.