A primer on large language models

This section is a bit technical. If you are not interested or you already know how an LLM works, feel free to skip to the introduction.

Large language models are a technology that has recently taken the world by storm. They are powerful, but it is important to remember they are not magic.

LLMs are as good as the data they are trained on and the data they are given to work with. This is important to keep in mind when you create knowledge bases and ultimately when you converse with LLMs.

Training LLMs is lengthy and expensive (in the sense of hardware, electricity and labor). This is due to the fact that these models are trained on vast amounts of data and they have an enormous number of parameters which require specialized hardware to train efficiently.

Context enrichment

Once an LLM is trained it will reflect the data it has been trained on. In other words, it will only regurgitate what it already knows. This begs the question; How do we teach it concepts from our domain? How do we provide them additional data to make them better suit our needs?

One technique, called fine-tuning, can be used to adjust the parameters of an LLM using custom data. This technique readjusts the model's parameters and makes it more likely to respond with the information we have trained it with. This is a very powerful technique, however it has its caveats.

Fine-tuning is an intensive process which, like training, requires a significant amount of data and computing power. It is not as intense as training, but it is still an expensive process and requires specialized hardware (if the goal is to do it in a reasonable amount of time). Additionally, once the model has been fine-tuned, new information that needs to be added to it will require a whole new round of fine-tuning.

While fine-tuning is an extremely valuable technique for smaller models, it is not the primary way of enriching LLM contexts that Ragu uses. Instead, Ragu uses retrieval augmented generation (RAG).

Don't be scared by the big words, the technique itself is much simpler than it sounds.

The jist of it is the following:

User prompts the LLM with a question.
A text embedding model is used to embed the prompt.
The prompt embeddings are used to perform semantic similarity search in a vector database.
An arbitrary amount of data is retrieved from the vector database (retrieval).
The data is prepended to the prompt to form a context enriched prompt (augmented).
The context enriched prompt is then sent to the LLM that generates a final response for the user (generation).

The beauty of RAG is that we get to fully utilise the power of an already trained LLM without having to go tinker with its underlying parameters. Instead, at inference (prompting) time we grab the most relevant information from the knowledge base and feed it to the LLM.

Text embeddings (vectors) and vector databases

A text embedding model is a model that converts text into a vector representation. Embedding text means creating a vector that represents the original text. These vector representations are used to represent the text in a way that allows us to calculate the "similarity" of the original text.

More specifically, "similarity" is the distance between two such vectors. Imagine two vectors on a coordinate plane, which represent two pieces of text. If the vectors are close together, i.e. if they are pointing in a similar direction, then the text is considered to be semantically "similar". If they are pointing opposite of each other, then the text is considered to be semantically "different".

The same text embedding model that embeds the prompt is also used to embed documents you store in collections. The prompt is embedded (transformed into its vector representation based on the embedding model) and then used as a reference vector to calculate the distance between it and other vectors in the vector database. The closest vectors' text contents are then retrieved and added to the prompt.

A vector database stores these vector representations in a way that allows efficient retrieval. However, unlike traditional databases, they do not search for exact matches, they search for the most similar matches. The similarity is defined by the embedding model, and is represented by the distance between the search vector (the prompt) and other vectors stored in the database (the documents).

To further clarify the concept of semantic similarity, think of the words "cat", "car", and "dog". A traditional database will have grouped "cat" and "car" together since they are similar lexicographically. A vector database, on the other hand, will have grouped "cat" and "dog" together since they are similar semantically.

Conversing with LLMs

LLMs have a specific way of accepting messages. Most LLMs will use the following format and message types:

system - A message that is sent to the LLM at the beginning of the conversation. This is typically used to give the LLM instructions on how to behave. A user never sends this manually, it is constructed automatically depending on the context you set up for the agent.
user - A message that is sent to the LLM by the user as part of a conversation.
assistant - An LLM generated message that is sent back to the user.

A system message is typically sent to the LLM at the beginning of a conversation, while the user and assistant messages are sent by the user and LLM, respectively and interchangeably (meaning a user message is always followed by an assistant message, never the reverse).

Keep in mind, this whole primer is an oversimplification for the sake of brevity. Whole papers have been written on these subjects and it's not feasible to explain these concepts in a single page. Nevertheless, these are the essential concepts that should help you when you interact with Ragu.