Build a RAG-based QnA application using Llama3 models from SageMaker JumpStart
In this post, we provide a step-by-step guide for creating an enterprise ready RAG application such as a question answering bot. We use the Llama3-8B FM for text generation and the BGE Large EN v1.5 text embedding model for generating embeddings from Amazon SageMaker JumpStart.
Organizations generate vast amounts of data that is proprietary to them, and it’s critical to get insights out of the data for better business outcomes. Generative AI and foundation models (FMs) play an important role in creating applications using an organization’s data that improve customer experiences and employee productivity.
The FMs are typically pretrained on a large corpus of data that’s openly available on the internet. They perform well at natural language understanding tasks such as summarization, text generation, and question answering on a broad variety of topics. However, they can sometimes hallucinate or produce inaccurate responses when answering questions that they haven’t been trained on. To prevent incorrect responses and improve response accuracy, a technique called Retrieval Augmented Generation (RAG) is used to provide models with contextual data.
In this post, we provide a step-by-step guide for creating an enterprise ready RAG application such as a question answering bot. We use the Llama3-8B FM for text generation and the BGE Large EN v1.5 text embedding model for generating embeddings from Amazon SageMaker JumpStart. We also showcase how you can use FAISS as an embeddings store and packages such as LangChain for interfacing with the components and run inferences within a SageMaker Studio notebook.
SageMaker JumpStart
SageMaker JumpStart is a powerful feature within the Amazon SageMaker ML platform that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models.
Llama 3 overview
Llama 3 (developed by Meta) comes in two parameter sizes—8B and 70B with 8K context length—that can support a broad range of use cases with improvements in reasoning, code generation, and instruction following. Llama 3 uses a decoder-only transformer architecture and new tokenizer that provides improved model performance with 128K size. In addition, Meta improved post-training procedures that substantially reduced false refusal rates, improved alignment, and increased diversity in model responses.
BGE Large overview
The embedding model BGE Large stands for BAAI general embedding large. It’s developed by BAAI and is designed to enhance retrieval capabilities within large language models (LLMs). The model supports three retrieval methods:
- Dense retrieval (BGE-M3)
- Lexical retrieval (LLM Embedder)
- Multi-vector retrieval (BGE Embedding Reranker).
You can use the BGE embedding model to retrieve relevant documents and then use the BGE reranker to obtain final results.
On Hugging Face, the Massive Text Embedding Benchmark (MTEB) is provided as a leaderboard for diverse text embedding tasks. It currently provides 129 benchmarking datasets across 8 different tasks on 113 languages. The top text embedding models from the MTEB leaderboard are made available from SageMaker JumpStart, including BGE Large.
For more details about this model, see the official Hugging Face mode card page.
RAG overview
Retrieval-Augmented Generation (RAG) is a technique that enables the integration of external knowledge sources with FM. RAG involves three main steps: retrieval, augmentation, and generation.
First, relevant content is retrieved from an external knowledge base based on the user’s query. Next, this retrieved information is combined or augmented with the user’s original input, creating an augmented prompt. Finally, the FM processes this augmented prompt, which includes both the query and the retrieved contextual information, and generates a response tailored to the specific context, incorporating the relevant knowledge from the external source.
Solution overview
You will construct a RAG QnA system on a SageMaker notebook using the Llama3-8B model and BGE Large embedding model. The following diagram illustrates the step-by-step architecture of this solution, which is described in the following sections.
Implementing this solution takes three high level steps: Deploying models, data processing and vectorization, and running inferences.
To demonstrate this solution, a sample notebook is available in the GitHub repo.
The notebook is powered by an ml.t3.medium instance to demonstrate deploying the model as an API endpoint using an SDK through SageMaker JumpStart. You can use these model endpoints to explore, experiment, and optimize for comparing advanced RAG application techniques using LangChain. We also illustrate the integration of the FAISS embeddings store into the RAG workflow, highlighting its role in storing and retrieving embeddings to enhance the application’s performance.
We will also discuss how you can use LangChain to create effective and more efficient RAG applications. LangChain is a Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.
After everything is set up, when a user interacts with the QnA application, the flow is as follows:
- The user sends a query using the QnA application.
- The application sends the user query to the vector database to find similar documents.
- The documents returned as a context are captured by the QnA application.
- The QnA application submits a request to the SageMaker JumpStart model endpoint with the user query and context returned from the vector database.
- The endpoint sends the request to the SageMaker JumpStart model.
- The LLM processes the request and generates an appropriate response.
- The response is captured by the QnA application and displayed to the user.
Prerequisites
To implement this solution, you need the following:
- An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
- Basic familiarity with SageMaker and AWS services that support LLMs.
- The Jupyter Notebooks needs ml.t3.medium.
- You need access to accelerated instances (GPUs) for hosting the LLMs. This solution needs access to a minimum of the following instance sizes:
- ml.g5.12xlarge for endpoint use when deploying the BGE Large En v1.5 text embedding model
- ml.g5.2xlarge for endpoint use when deploying the Llama-3-8B model endpoint
To increase your quota, refer to Requesting a quota increase.
Prompt template for Llama3
While both Llama 2 and Llama 3 are powerful language models that are optimized for dialogue-based tasks, their prompting formats differ significantly in how they handle multi-turn conversations, specify roles, and mark message boundaries, reflecting distinct design choices and trade-offs.
Llama 3 prompting format: Llama 3 employs a structured format designed for multi-turn conversations involving different roles (system, user, and assistant). It uses dedicated tokens to explicitly mark roles, message boundaries, and the end of the prompt:
- Placeholder tokens:
{{user_message}}
and{{assistant_message}}
- Role marking:
<|start_header_id|>{role}<|end_header_id|>
- Message boundaries:
<|eot_id|>
signals end of a message within a turn. - Prompt End Marker:
<|start_header_id|>assistant<|end_header_id|>
signals start of assistant’s response.
Llama 2 prompting format: Llama 2 uses a more compact representation with different tokens for handling conversations:
- User message enclosure:
[INST][/INST]
- Start and end of sequence:
- System message enclosure:
<
>< > - Message separation:
separates user messages and model responses.
Key differences:
- Role specification: Llama 3 uses a more explicit approach with dedicated tokens, while Llama 2 relies on enclosing tags.
- Message boundary marking: Llama 3 uses
<|eot_id|>
, Llama 2 uses
. - Prompt end marker: Llama 3 uses
<|start_header_id|>assistant<|end_header_id|>
, Llama 2 uses[/INST] and
.
The choice depends on the use case and integration requirements. Llama 3’s format is more structured and role-aware and is better suited for conversational AI applications with complex multi-turn conversations. Llama 2’s format, while more compact, might be less explicit in handling roles and message boundaries.
Implement the solution
To implement the solution, you’ll use the following steps:
- Set up a SageMaker Studio notebook
- Deploy models on Amazon SageMaker JumpStart
- Set up Llama3-8b and BGE Large En v1.5 models with LangChain
- Prepare data and generate embeddings
- Load documents of different kind and generate embeddings to create a vector store
- Retrieve documents to the question using the following approaches from LangChain
- Regular Retrieval Chain
- Parent Document Retriever Chain
- Prepare a prompt that goes as input to the LLM and presents an answer in a human friendly manner
Set up a SageMaker Studio notebook
To follow the code in this post:
- Open SageMaker Studio and clone the following GitHub repository.
- Open the notebook RAG-recipes/llama3-rag-langchain-smjs.ipynb and choose the PyTorch 2.0.0 Python 3.10 GPU Optimized image, Python 3 kernel, and
ml.t3.medium
as the instance type. - If this is your first time using SageMaker Studio notebooks, see Create or Open an Amazon SageMaker Studio Notebook.
To set up the development environment, you need to install the necessary Python libraries, as demonstrated in the following code. The example notebook provided includes these commands:
After the libraries are written in requirement.txt
, install all the libraries:
Deploy pretrained models
After you’ve imported the required libraries, you can deploy the Llama 3 8B Instruct
LLM model on SageMaker JumpStart using the SageMaker SDK:
- Import the
JumpStartModel
class from the SageMaker JumpStart library - Specify the model ID for the HuggingFace
Llama 3 8b Instruct
LLM model, and deploy the model. - Specify the model ID for the HuggingFace BGE Large EN embedding model and deploy the model.
Set up models with LangChain
For this step, you’ll use the following code to set up models.
- Replace the endpoint names in the below code snippet with the endpoint names that are deployed in your environment. You can get the endpoint names from predictors created in the previous section or view the endpoints created by going to SageMaker Studio, left navigation deployments → endpoints and replace the values for
llm_endpoint_name
andembedding_endpoint_name
. - Transform input and output data to process API calls for
Llama 3 8B Instruct
on Amazon SageMaker. - Instantiate the LLM with SageMaker and LangChain
- Transform input and output data to process API calls for
BGE Large En
on SageMaker - Instantiate the embedding model with SageMaker and LangChain
Prepare data and generate embeddings
In this example, you will use several years of Amazon’s Annual Reports (SEC filings) for investors as a text corpus to perform QnA on.
- Start by using the following code to download the PDF documents from the provided URLs and create a list of metadata for each downloaded document.
If you look at the Amazon 10-Ks, the first four pages are all the very similar and might skew the responses if they are kept in the embeddings. This will cause repetition, take longer to generate embeddings, and might skew your results.
- In the next step, you will take the downloaded data, trim the 10-K (first four pages) and overwrite them as processed files.
- After downloading, you can load the documents with the help of DirectoryLoader from PyPDF available under LangChain and splitting them into smaller chunks. Note: The retrieved document or text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also, the embedding model has a limit on the length of input tokens of 512 tokens, which translates to approximately 2,000 characters. For this use-case, you are creating chunks of approximately 1,000 characters with an overlap of 100 characters using RecursiveCharacterTextSplitter.
- Before you proceed, look at some of the statistics regarding the document preprocessing you just performed:
- You started with four PDF documents, which have been split into approximately 500 smaller chunks. Now you can see how a sample embedding would look like for one of those chunks.
This can be done using FAISS implementation inside LangChain which takes input from the embedding model and the documents to create the entire vector store. Using the Index Wrapper, you can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents, and calling the LLM. VectorStoreIndexWrapper.
Answer questions using a LangChain vector store wrapper
You use the wrapper provided by LangChain, which wraps around the vector store and takes input from the LLM. This wrapper performs the following steps behind the scenes:
- Inputs the question
- Creates question embedding
- Fetches relevant documents
- Stuffs the documents and the question into a prompt
- Invokes the model with the prompt and generate the answer in a human readable manner.
Note: In this example we are using Llama 3 8B Instruct
as the LLM under Amazon SageMaker, this particular model performs best if the inputs are provided under
<|begin_of_text|><|start_header_id|>system<|end_header_id|>,
{{system_message}},
<|eot_id|><|start_header_id|>user<|end_header_id|>,
{{user_message}}
, and the model is requested to generate an output after
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
.
The following is an example of how to control the prompt so that the LLM stays grounded and doesn’t answer outside the context.
You can ask another question.
Retrieval QA chain
We’ve shown you a basic method to get context-aware answers. Now, let’s look at a more customizable option with RetrievalQA. You can customize how fetched documents are added to the prompt using the chain_type
parameter, control the number of relevant documents retrieved by changing the k parameter, and get source documents used by the LLM by enabling return_source_documents.RetrievalQA
also allows providing custom prompt templates specific to the model.
You can then ask a question:
Parent document retriever chain
Let’s explore a more advanced RAG option with ParentDocumentRetriever. It balances storing small chunks for accurate embeddings and larger chunks to preserve context. First, a parent_splitter
divides documents into larger parent chunks. Then, a child_splitter
creates smaller child chunks. Child chunks are indexed in a vector store using embeddings for efficient retrieval. To retrieve relevant info, ParentDocumentRetriever
fetches child chunks from the vector store, looks up their parent IDs, and returns corresponding larger parent chunks, stored in an InMemoryStore. This approach balances accurate embeddings with contextual information for meaningful retrieval.
- Sometimes, the full documents can so large that you don’t want to retrieve them as is. In that case, you can first split the raw documents into larger chunks, and then split it into smaller chunks. You then index the smaller chunks, but on retrieval you retrieve the larger chunks (but still not the full documents).
- Now, initialize the chain using the
ParentDocumentRetriever
. Pass the prompt in using thechain_type_kwargs
argument. - Start asking questions:
Clean up
To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints and OpenSearch Service domain, either using the following code snippets or the SageMaker JumpStart UI.
To use the SageMaker console, complete the following steps:
- On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
- Search for the embedding and text generation endpoints.
- On the endpoint details page, choose Delete.
- Choose Delete again to confirm.
Conclusion
In this post, we showed you a powerful RAG solution using SageMaker JumpStart to deploy the Llama 3 8B Instruct model and the BGE Large En v1.5 embedding model.
We showed you how to create a robust vector store by processing documents of various formats and generating embeddings. This vector store facilitates retrieving relevant documents based on user queries using LangChain’s retrieval algorithms. We demonstrated the ability to prepare custom prompts tailored for the Llama 3 model, ensuring context-aware responses, and presented these context-specific answers in a human-friendly manner.
This solution highlights the power of SageMaker JumpStart in deploying cutting-edge models and the versatility of LangChain in creating effective RAG applications. By seamlessly integrating these components, we enabled high-quality, context-specific response generation, enhancing the Llama 3 model’s performance across natural language processing tasks. To explore this solution and embark on your context-aware language generation journey, visit the notebook in the GitHub repository.
To get started now, check out SageMaker JumpStart in SageMaker Studio.
- SageMaker JumpStart documentation
- SageMaker JumpStart Foundation Models documentation
- SageMaker JumpStart product detail page
- SageMaker JumpStart model catalog
About the Authors
Supriya Puragundla is a Senior Solutions Architect at AWS. She has over 15 years of IT experience in software development, design and architecture. She helps key enterprise customer accounts on their data, generative AI and AI/ML journeys. She is passionate about data-driven AI and the area of depth in ML and generative AI.
Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.
Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyperscale on AWS. Marco is based in Seattle, WA, and enjoys writing, reading, exercising, and building applications in his free time.
Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.
Yousuf Athar is a Solutions Architect at AWS specializing in generative AI and AI/ML. With a Bachelor’s degree in Information Technology and a concentration in Cloud Computing, he helps customers integrate advanced generative AI capabilities into their systems, driving innovation and competitive edge. Outside of work, Yousuf loves to travel, watch sports, and play football.
Gaurav Parekh is an AWS Solutions Architect specializing in Generative AI, Analytics and Networking technologies.