AI In Production 2023-12-01 - Implementing a Retrieval Augmented Generation (RAG) System.
Retrieval Augmented Generation (RAG) is a technique that combines information retrieval and LLM based generation by first using the user's query to find relevant information from a database or corpus and then adding it to the prompt along with additional instructions to respond based only on the information in the prompt.
It is primarily used to provide up-to-date and non-public information to an LLM as well to reduce hallucinations and improve responses by giving the LLM additional relevant context. For example, most LLMs have not seen your internal employee manual and could not reliably answer a question about the sick leave policy on their own.
In its most basic form the process is:
- The operator selects the documents to include in the system.
- If the documents are not text (such as PDFs, Word docs, etc.) the text is extracted.
- The text is chunked into semantically relevant pieces (ie. paragraphs).
- Each piece is embedded with a specialized ML model that generates a multi-dimensional vector.
- Each vector is typically put in a vector store (specialized database) for easy retrieval.
- Incoming queries are embedded and the vector store is queried for document chunks that are "close by" in the embedding space. These are expected to be semantically relevant to the query.
- Some number of the close by chunks are added to the prompt and the prompt is submitted to the LLM.
- The LLM considers the entirety of the prompt and generates a response..
There are many choices and challenges that one needs to be aware of. Such as:
- evaluation - understanding the types of questions you'd like to answer and assessing you ability to do so.
- the right relevant documents - obvious but important. How do you know your documents are not out of date? How do you need to process them?
- detecting hallucinations - ensuring that RAG reduces hallucinations to a manageable level. Providing citations.
- text extraction - PDFs are notoriously hard to process.
- a chunking strategy - are paragraphs the right approach for your docs?
- the embedding model - there are many and performance can vary per use case.
- computational requirements of embedding - it can be an issue when processing many long documents.
- the retrieval policy - how many docs should you include? Or should you limit it by distance/similarity?
- the vector database - there are many specialized ones and traditional ones (like Postgres) have added vector capabilities.
- updating the vector store - rarely or real-time? What are your requirements?
There are other related techniques like advanced prompting and alternatives to pure vector retrieval. More on that in a later article.