In this article, you will learn how to create a Retrieval-Augmented Generation (RAG) application that can work with your PDFs or other data sources. This type of application is useful for handling large amounts of text data, such as books or lecture notes, to help create a chatbot that can answer any query based on the provided data. The best part is that we will be using an open-source model, so there is no need to pay for API access.
RAG has gained immense popularity and is one of the most used AI frameworks for creating customized chatbots. It’s also a powerful tool for building knowledge-driven AI applications.
RAG can be thought of as an AI assistant that is well-versed in user data and human language. When asked a question, it utilizes a library of information to provide a detailed and accurate answer. It is a powerful combination of an information retrieval system and a robust LLM.
Retrieval-Augmented Generation (RAG) significantly improves accuracy by retrieving relevant information from various external knowledge external sources, enhancing the context and precision of generated responses. By using factual data during the retrieval process, RAG can also reduce hallucinations, a common issue in large language models. Additionally, RAG enhances in-context learning by retrieving specific, up-to-date information, making it ideal for use cases like Q&A, document summarization, and interactive workflows.
For example, we used the research paper of YOLO v9 as the data source, and the LLM was Llama-2-7b-chat. We then asked a few questions about the research paper, and here are relevant results from the replies.
response=query_engine.query("What is YOLOv9")
YOLOv9 is a new object detection algorithm that improves upon existing methods in several ways. First, it uses a new architecture that reduces the number of parameters and computations while maintaining the same level of accuracy. Second, it uses a new technique called PGI to accurately retain and extract the information needed to map the data to the target. Finally, it is shown to have significantly improved performance compared to existing methods in all aspects.
response_2=query_engine.query("what is GELAN architecture?")
GELAN architecture is a deep neural network designed for object detection tasks. It consists of multiple prediction branches, each of which is connected to a deep supervision branch. The deep supervision branch is responsible for providing auxiliary information to the main branch, which is used to improve the accuracy of object detection. The GELAN architecture also includes an integration network that combines returned gradients from different prediction heads, allowing the main branch to retain complete information about all target objects. This allows the method to alleviate the broken information problem in deep supervision and improve the overall performance of object detection tasks.
response_3=query_engine.query("What is new with YOLO v9?")
YOLO v9 introduces several new features, including:
1. Conventional convolution: YOLO v9 uses conventional convolution instead of depth-wise convolution, which leads to better parameter utilization.
2. PGI: YOLO v9 uses a new technique called PGI (Progressive Gating and Integration) to accurately retain and extract information needed to map the data to the target.
3. Large models: YOLO v9 shows huge advantages in using large models, requiring only 66% of the parameters while maintaining accuracy as RT DETR-X.
4. Comparison with state-of-the-arts: YOLO v9 compares with other train-from-scratch real-time object detectors, showing significant improvement in all aspects.
Please let me know if you have any further questions or if there's anything else I can help you with.
We even tried the application with some personal data, and here is the result.
response=query_engine.query("Who is Shaoni")
Shaoni Mukherjee is a seasoned Technical Writer and AI Specialist with a deep passion for Generative AI and its transformative potential. With over four years of experience in data science and a strong foundation in AI/ML technologies, she specializes in creating in-depth, technical content that simplifies complex concepts. Currently contributing to DigitalOcean, Shaoni focuses on topics like GPU acceleration, deep learning, and large language models (LLMs), ensuring that developers and businesses alike can harness cutting-edge technology. Her expertise lies in breaking down technical innovations into digestible, actionable insights, making her a trusted voice in the world of AI.
transformers
library from Hugging Face for loading pre-trained models and fine-tuning them for RAG.We all know that large language models (LLMs) are great at generating responses, but if you ask question based on your companies financial status, it will fail and start giving inaccurate information. This happens because LLMs lack access to our personnel and updated data. By incorporating retrieval-augmented generation (RAG) features into foundation models, we can provide the LLM with our personnel and updated data. This allows us to ask any financial query to the LLM application, and it will provide answers based on the accurate information we provide as the data source. When we add retrieval-augmented features to a large language model (LLM), it changes how the model finds answers. Instead of only using what it already knows, the LLM now has access to more accurate information.
Here’s how it works:
This approach allows the model to improve its responses by incorporating additional information, it’s own data, rather than relying solely on its existing knowledge. RAG (Retriever-Augmented Generation) helps to avoid the need to retrain the model with new data. Instead, we can simply update our existing training data often. For instance, if new insights or data are discovered, we can add this new information to our existing resources. As a result, when a user asks a question, the model can access this updated content without going through the entire training process again. This ensures that the model is always capable of providing the most current and relevant answers based on the latest data.
Implementing this approach reduces the likelihood of the model generating incorrect information. It also enables the model to acknowledge when it doesn’t have an answer, if it can’t find a sufficient response within the data store. However, if the retriever doesn’t provide the foundation model with high-quality information, the model might miss answering a question it could have otherwise addressed.
A user asks a question or provides input for an augmented prompt, which can be a statement, query, or task.
The user’s input is first converted into a machine-readable format using an embedding model. Embeddings represent the meaning of the query in a vector (numeric) form, making it easier to match user preferences with relevant information. This numerical representation is stored in a vector database.
We recommend going through the tutorial to set up the GPU Droplet and run the code. We have added a link to the references section that will guide you through creating a GPU Droplet and configuring it using VSCode.
To begin, we will need a PDF, Markdown, or any documentation files. Make sure to create a separate folder to store the PDFs.
Start by installing all the necessary packages. The below code provides all the necessary packages to be installed as a first step.
!pip install pypdf
!pip install -U bitsandbytes
!pip install langchain
!pip install -U langchain-community
!pip install sentence_transformers
!pip install llama_index
!pip install llama-index-llms-huggingface
!pip install llama-index-llms-huggingface-api
!pip install llama-index-embeddings-langchain
from llama_index.core import VectorStoreIndex,SimpleDirectoryReader,ServiceContext,PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts.prompts import SimpleInputPrompt
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
# from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
The following section contains the complete code for building the RAG application. Each step is explained throughout the article as you read along.
import torch
documents=SimpleDirectoryReader("your/pdf/location/data").load_data()
# print(documents)
system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""
## Default format supportable by LLama2
query_wrapper_prompt=SimpleInputPrompt("\<|USER|\>{query\_str}\<|ASSISTANT|\>")
!huggingface-cli login
llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=256,
generate_kwargs={"temperature": 0.0, "do_sample": False},
system_prompt=system_prompt,
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
model_name="meta-llama/Llama-2-7b-chat-hf",
device_map="auto",
model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True}
)
embed_model = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
Settings.num_output = 512
Settings.context_window = 3900
# a vector store index only needs an embed model
index = VectorStoreIndex.from_documents(
documents, embed_model=embed_model
)
# create a query engine
query_engine = index.as_query_engine(llm=llm)
response=query_engine.query("what is GELAN architecture?")
print(response)
Once we store the data, it needs to be split into chunks. The code below loads the data and splits it into chunks.
# load the data
documents=SimpleDirectoryReader("//your repo path/data").load_data()
# split the data into chunks
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
The document will contain all the content or text and metadata. Now, a document can be really long, so we need to split each document into smaller chunks. This is part of the preprocessing step for preparing the data for RAG. These smaller, focused pieces of information help the system find and retrieve the relevant context and details more accurately. By breaking documents into clear sections, it’s easier to locate domain specific information, in passages or facts, increasing the RAG application’s performance. We can even use “RecursiveCharacterTextSplitter” from “langchain.text_splitter” in our case we are using “SentenceSplitter” from “llama_index.core.node_parser.”
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
For more information on RecursiveCharacterTextSplitter please visit the link in the reference section.
Now, we will learn about the embeddings!
Embeddings are numerical representations of text data that help capture the data’s underlying meaning. They convert data into vectors, essentially arrays of numbers, making them easier for machine learning models to understand and work with.
In the case of text embeddings (e.g., word or sentence embeddings), vectors are designed so that words or phrases with similar meanings are close to each other in the vector space. For instance, “king” and “queen” would have close vectors, while “king” and “apple” would be far apart. Further, the distance between these vectors can be calculated by cosine similarity or Euclidean distance.
For example, here, we will use “sentence-transformers/all-mpnet-base-v2” from HuggingFaceEmbeddings.
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
embed_model = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)
This step involves selecting a pre-trained model in this case ‘sentence-transformers/all-mpnet-base-v2’, to generate the embeddings due to its compact size and strong performance. We can pick a model from the Sentence Transformers library, which maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search in search engines.
# a vector store index only needs an embed model
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(
documents, embed_model=embed_model
)
The same embedding model will then be used to create the embeddings for the documents during the index construction process and for any queries for the query engine.
# create a query engine
query_engine = index.as_query_engine(llm=llm)
response=query_engine.query("Who is Shaoni")
print(response)
Now, let us talk about our LLM, here we are using Llama 2, 7B fine-tuned model for our example. Meta has developed and released the Llama 2 family of large language models (LLMs), which includes a range of pre-trained and fine-tuned generative text models with sizes from 7 billion to 70 billion parameters. These models consistently outperformed many open-source chat models, and they are comparable to popular closed-source models like ChatGPT and PaLM.
system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""
## Default format supportable by LLama2
query_wrapper_prompt=SimpleInputPrompt("\<|USER|\>{query\_str}\<|ASSISTANT|\>")
Now, we can use our LLM, embedded model, and documents to ask questions about them and then use the lines of code provided here.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents=SimpleDirectoryReader("//your repo path/data").load_data()
index = VectorStoreIndex.from_documents(
documents, embed_model=embed_model
)
query_engine = index.as_query_engine(llm=llm)
response=query_engine.query("what are the drabacks discussed in yolo v9?")
print(response)
YOLOv9 has several drawbacks discussed in the paper, including:
1. Computational complexity: While YOLOv9 is Pareto optimal in terms of accuracy and computation complexity among all models with different scales, it still has a relatively high computation complexity compared to other state-of-the-art methods.
2. Parameter utilization: YOLOv9 using conventional convolution has lower parameter utilization than YOLO MS using depth-wise convolution, and even worse, large models of YOLOv9 have lower parameter utilization than RT DETR using ImageNet pretrained model.
3. Training time: YOLOv9 requires a longer training time compared to other state-of-the-art methods, which can be a limitation for real-time object detection applications.
Please let me know if you have any further questions or if there's anything else I can help you with.
Though this tutorial does not require our readers to have a high-end GPU however, standard CPUs will not be sufficient to handle the computation efficiently. Hence handling more complex operations—such as generating vector embeddings or using large language models—will be much slower and may lead to performance issues. For optimal performance and faster results, it’s recommended to use a capable GPU, especially when we have a large number of documents or datasets or if we are using a more advanced LLM like Falcon 180b. Using DigitalOcean’s GPU Droplets for creating a Retrieval-Augmented Generation (RAG) application, will offer several benefits:
In conclusion, Retrieval-Augmented Generation (RAG) is an important AI framework that significantly enhances the capabilities of large language models (LLMs) to create AI applications. By effectively combining the strengths of information retrieval with the power of large language models, RAG systems can deliver accurate, contextually relevant, and informative responses. This integration improves the quality of interactions across various domains—such as customer support, content creation, and personalized recommendations—and allows organizations to leverage vast amounts of data efficiently. As the demand for intelligent, responsive applications grows, RAG will stand out as a powerful framework that helps developers build more intelligent systems that better serve users’ needs. Its adaptability and effectiveness make it a key player in the future of AI-driven solutions.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Sign up for Infrastructure as a Newsletter.
Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.