RAG is an AI framework that helps optimize the output of large language models or LLMs.
RAG combines retrieved information and generates natural language to create responses.
RAG consists of two main components: the retriever, the core of RAG, and the generator, which functions as a chatbot.
In RAG process:
The retriever encodes user-provided prompts and relevant documents into vectors, stores them in a vector database, and retrieves relevant context vectors based on the distance between the encoded prompt and documents.
The generator then combines the retrieved context with the original prompt to produce a response.
The Dense Passage Retrieval (or DPR) Context Encoder and its tokenizer focus on encoding potential answer passages or documents. This encoder creates embeddings from extensive texts, allowing the system to compare these with question embeddings to find the best match.
Facebook AI Similarity Search, also known as Faiss, is a library developed by Facebook AI Research that offers efficient algorithms for searching through large collections of high-dimensional vectors.
Faiss is essentially a tool to calculate the distance between the question embedding and the vector database of context vector embeddings.
The DPR question encoder and its tokenizer focus on encoding the input questions into fixed-dimensional vector representations, grasping their meaning and context to facilitate answering them.
The chatbot generates responses based on the questions; however, it is challenging to generate responses for specific domains, such as the company’s mobile policy.
RAG process involves encoding prompts into vectors, storing them, and retrieving the relevant ones to produce a response.
The DPR Context Encoder and its tokenizer encode the potential answer passages or documents.
FAISS is a library developed by Facebook AI Research for searching through large collections of high-dimensional vectors.
In-context learning is a technique that incorporates task-specific examples into the prompt to boost model performance.
Prompt engineering enhances the effectiveness and accuracy of LLMs by designing prompts that ensure relevant responses without continual fine-tuning.
Advanced prompt engineering methods like zero-shot, few-shot, Chain-of-Thought prompting, and self-consistency enhance LLM interactions.
Tools like LangChain and agents can facilitate effective prompt creation and enable complex, multi-domain tasks.
LangChain is an open-source interface that simplifies the application development process using LLMs.
The ‘Document object’ in LangChain serves as a container for data information, including two key attributes, such as page_content and metadata.
The LangChain document loader handles various document types, such as HTML, PDF, and code from various locations.
Chains in LangChain enable sequential processing, where the output from one step becomes the input for the next, streamlining the prompt generation and processing workflow.
Agents in LangChain dynamically sequence actions, integrating with external tools like search engines and databases to fulfill complex user requests.
LangChain provides an environment for building and integrating large language model (LLM) applications into external data sets and workflow.
LangChain simplifies the integration of language models like GPT-4 and makes it accessible for developers to build natural language processing or NLP applications.
The components of LangChain are:
Chains, agents, and retriever
LangChain-Core
LangChain-Community
Generative models understand and capture the underlying patterns and data distribution to resemble the given data sets. Generative models are applicable in generating images, text, and music, augmenting data, discovering drugs, and detecting anomalies.
Types of generative models are:
Gaussian mixture models (GMMs)
Hidden Markov models (HMMs)
Restricted Boltzmann machines (RBMs)
Variational autoencoders (VAEs)
Generative adversarial networks (GANs)
Diffusion models
In-context learning is a method of prompt engineering where task demonstrations are provided to the model as part of the prompt.
Prompts are inputs given to an LLM to guide it toward performing a specific task. They consist of instructions and context.
Prompt engineering is a process where you design and refine the prompts to get relevant and accurate responses from AI.
Prompt engineering has several advantages:
It boosts the effectiveness and accuracy of LLMs.
It ensures relevant responses.
It facilitates meeting user expectations.
It eliminates the need for continual fine-tuning.
A prompt consists of four key elements: instructions, context, input data, and output indicator.
Advanced methods for prompt engineering are: zero-shot prompting, few-shot prompting, chain-of-thought prompting, and self-consistency.
Prompt engineering tools facilitate interactions with LLMs.
LangChain uses “prompt templates,” which are predefined recipes for generating effective prompts for LLMs
An agent is a key component in prompt applications that can perform complex tasks across various domains using different prompts.
The language models in LangChain use text input to generate text output.
The chat model understands the questions or prompts and responds like a human.
The chat model handles various chat messages, such as:
HumanMessage
AIMessage
SystemMessage
FunctionMessage
ToolMessage
The prompt templates in LangChain translate the questions or messages into clear instructions.
An example selector instructs the model for the inserted context and guides the LLM to generate the desired output.
Output parsers transform the output from an LLM into a suitable format.
LangChain facilitates comprehensive tools for retrieval-augmented generation (RAG) applications, focusing on the retrieval step to ensure sufficient data fetching.
The “Document object” in LangChain serves as a container for data information, including two key attributes, such as page_content and metadata.
The LangChain document loader handles various document types, such as HTML, PDF, and code, from various locations.
LangChain in document retrieves relevant isolated sections from the documents by splitting them into manageable pieces.
LangChain embeds documents and facilitates various retrievers.
LangChain is a platform that embeds APIs for developing applications.
Chains in the LangChain is a sequence of calls. In chains, the output from one step becomes the input for the next step.
In LangChain, chains first define the template string for the prompt, then create a PromptTemplate using the defined template and create an LLMChain object name.
In LangChain, memory storage is important for reading and writing historical data.
Agents in LangChain are dynamic systems where a language model determines and sequences actions, such as predefined chains.
Agents integrate with tools such as search engines, databases, and websites to fulfill user requests.
Standard architecture of a Retrieval-Augmented Generation (RAG) system, the correct pipeline is:
Chunk → Embed → Index → Retrieve → Generate
Why this is the correct pipeline
This sequence represents the complete lifecycle of data in a RAG system, from raw text to a final AI answer.
Chunk (Segmentation): Large knowledge documents (like PDFs or manuals) are split into smaller, manageable pieces called "chunks" (e.g., 300–500 words). This ensures that the context passed to the AI is precise and fits within its processing limits.
Embed (Vectorization): These text chunks are passed through an embedding model, which converts them into numerical vectors (lists of numbers) that represent the semantic meaning of the text.
Note: This specific step is the "converting into a vectorized form" mentioned in your question.
Index (Storage): The resulting vectors are stored in a vector database (or index) to allow for ultra-fast similarity searching later.
Retrieve: When a user asks a question, the system searches the index to find the vectors most similar to the user's query.
Generate: The retrieved text chunks are combined with the user's question and sent to the Large Language Model (LLM) to generate a final, accurate response.
Glossary: Fundamentals of Building AI Agents using RAG and LangChain:
| Term | Definition |
|---|---|
| Bidirectional and Auto-Regressive Transformers (BART) | Sequence-to-sequence large language model (LLM) that follows an encoder-decoder architecture. It leverages encoding for contextual understanding and decoding to generate text. |
| Bidirectional Representation of Transformers (BERT) | An open-source, deeply bidirectional, unsupervised language representation pretrained using a plain text corpus. |
| Bradley-Terry model | A probability model for the outcome of pairwise comparisons between items, teams, or objects. |
| Chain-of-thought (CoT) | An AI technique that simulates human-like reasoning by breaking down complex tasks into logical steps. |
| Chat model | A model designed for efficient conversations. It means that it understands the questions or prompts and responds to them like a human. |
| Context encoder | A neural network architecture used for image inpainting. |
| Contextual embeddings | A type of embedding that aptly describes how the transformer processes the input word embeddings by accounting for the context in which each word occurs within the sequence. |
| Data leakage | An organization faces challenges in exposing sensitive information. |
| Dense Passage Retrieval (DPR) | A set of tools that fetches relevant passages with respect to the question asked based on the similarity between the high-quality, low-dimensional continuous representation of passages and questions. |
| Facebook AI Similarity Search (Faiss) | It is a library developed by Facebook AI Research that offers efficient algorithms for searching through large collections of high-dimensional vectors. |
| Faiss index | A data structure that facilitates efficient similarities between vector searches. |
| Few-shot prompt | A technique where the model provides a small number of examples, usually between two and five, to adapt new examples from the previous objects. |
| Fine-tuning | A supervised process that optimizes the initially trained GPT model for specific tasks, like QA classification. |
| Generative pre-trained transformer (GPT) | A self-supervised model that involves training a decoder to predict the subsequent token or word in a sequence. |
| GitHub | A developer platform to create, store, manage, and share codes. |
| Graphic processing unit (GPU) | A process that helps to render graphic smoothly. |
| Hugging Face | Platform that offers an open-source library with pretrained models and tools to streamline the process of training and fine-tuning generative AI models. |
| In-Context learning | A technique in which task demonstrations are integrated into the prompt in a natural language format. |
| LangChain | An open-source interface that simplifies the application development process using LLMs. It facilitates a structured way to integrate language models into various use cases, including natural language processing or NLP and data retrieval. |
| LangChain-Core | A LangChain Expression Language and is the base for abstractions. |
| LangChain chains | Sequences of calls |
| Language model | A model that predicts words by analyzing the previous text, where context length acts as a hyperparameter. |
| Large language models (LLMs) | Foundation models that use AI and deep learning with vast data sets to generate text, translate languages, and create various types of content. They are called large language models due to the size of the training data set and the number of parameters. |
| Machine learning | Machine learning is a data analysis method for automating analytical model building. |
| Model inference | In machine learning, model inference refers to the operationalization of a trained ML model. |
| Natural language processing (NLP) | The subfield of artificial intelligence (AI) that deals with the interaction of computers and humans in human language. It involves creating algorithms and models that will help computers understand and comprehend human language and generate contextually relevant text in human language. |
| Prompt engineering | A process of creating effective prompts to enable AI models to generate responses based on the given inputs. |
| Prompt template | A predefined structure or a format that can be filled with specific content to generate prompts. |
| Python | A programming language. |
| PyTorch | A software-based open-source deep learning framework used to build neural networks, combining Torch's machine learning library with a Python-based high-level API. |
| PyTorch tensors | A fundamental data structure that is useful to represent a multi-dimensional array. |
| Retrieval-augmented generation (RAG) | RAG is an AI framework that helps optimize the output of large language models or LLMs. RAG uses the capabilities of LLMs in specific domains or the internal database of an organization without retraining the model. |
| Scoring function | Measures the summary for the evaluation of the point prediction. It means it predicts a property or a function. |
| Self-consistency | A technique for enhancing the reliability and accuracy of outputs. |
| Tokenization | The process of converting the words in the prompt into tokens. |
| Text classifier | A machine learning technique that assigns a set of predefined categories to open-ended text. |
| Vector averaging | Process of calculating mean vector from a set of vectors. |
| watsonx.ai | A platform that allows developers to leverage a wide range of large language models (LLMs) under IBM's own series. |
| WatsonxLLM | A wrapper of IBM watsonx.ai foundation models. |
| Zero-shot prompt | A prompt in natural language processing (NLP) where a model can generate results for tasks that have not been trained explicitly. |
Cheat Sheet: Fundamentals of Building AI Agents using RAG and LangChain
| Package/Method | Description | Code example |
|---|---|---|
| Generate text | This code snippet generates text sequences based on the input and doesn't compute the gradient to generate output. |
|
| formatting_prompts_func_no_response function | The prompt function generates formatted text prompts from a data set by using the instructions from the data set. It creates strings that include only the instruction and a placeholder for the response. |
|
| torch.no_grad() | This code snippet helps generate text sequences from the pipeline function. It ensures that the gradient computations are disabled and optimizes the performance and memory usage. |
|
| mixtral-8x7b-instruct-v01 watsonx.ai inference model object | Adjusts the parameters to push the limits of creativity and response length. |
|
| String prompt templates | Used to format a single string and are generally used for simpler inputs. |
|
| Chat prompt templates | Used to format a list of messages. These "templates" consist of a list of templates themselves. |
|
| Messages place holder | This prompt template is responsible for adding a list of messages in a particular place. But if you want the user to pass in a list of messages that you would slot into a particular spot, the given code snippet is helpful. |
|
| Example selector | If you have many examples, you may need to select which ones to include in the prompt. The Example Selector is the class responsible for doing so. |
|
| JSON parser | This output parser allows users to specify an arbitrary JSON schema and query LLMs for outputs that conform to that schema. |
|
| Comma separated list parser | This output parser can be used when you want to return a list of comma-separated items. |
|
| Document object | Contains information about some data in LangChain. It has two attributes: page_content: str: This attribute holds the content of the document. metadata: dict: This attribute contains arbitrary metadata associated with the document. It can be used to track various details such as the document id, file name, and so on. |
|
| text_splitter | At a high level, text splitters work as follows: • Split the text into small, semantically meaningful chunks (often sentences). • Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function). • Once you reach that size, make that chunk its own piece of text and start creating a new chunk with some overlap (to keep context between chunks). |
|
| Embedding models | Embedding models are specifically designed to interface with text embeddings. Embeddings generate a vector representation for a given piece of text. This is advantageous as it allows you to conceptualize text within a vector space. Consequently, you can perform operations such as semantic search, where you identify pieces of text that are most similar within the vector space. |
|
| Vector store-backed retriever | A retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR (maximum marginal relevance), to query the texts in the vector store. Since we've constructed a vector store docsearch, it's very easy to construct a retriever. |
|
| ChatMessageHistory class | One of the core utility classes underpinning most (if not all) memory modules is the ChatMessageHistory class. This super lightweight wrapper provides convenient methods for saving HumanMessages, AIMessages, and then fetching them all. |
|
| langchain.chains | This code snippet uses a LangChain, library for building language model applications, creating a chain to generate popular dish recommendations based on the specified locations. It also configures model inference settings for further processing. |
|
| Simple sequential chain | Sequential chains allow the output of one LLM to be used as the input for another. This approach is beneficial for dividing tasks and maintaining the focus of your LLM. |
|
| load_summarize_chain | This code snippet uses LangChain library for loading and using a summarization chain with a specific language model and chain type. This chain type will be applied to web data to print a resulting summary. |
|
| TextClassifier | Represents a simple text classifier that uses an embedding layer, a hidden linear layer with a ReLU avtivation, and an output linear layer. The constructor takes the following arguments: num_class: The number of classes to classify. freeze: Whether to freeze the embedding layer. |
|
| Train the model | This code snippet outlines the function to train a machine learning model using PyTorch. This function trains the model over a specified number of epochs, tracks them, and evaluates the performance on the data set. |
|
| llm_model | This code snippet defines function 'llm_model' for generating text using the language model from the mistral.ai platform, specifically the 'mitral-8x7b-instruct-v01' model. The function helps in customizing generating parameters and interacts with IBM Watson's machine learning services. |
|
| Zero-shot prompt | Zero-shot learning is crucial for testing a model's ability to apply its pre-trained knowledge to new, unseen tasks without additional training. This capability is valuable for gauging the model's generalization skills. |
|
| One-shot prompt | One-shot learning example where the model is given a single example to help guide its translation from English to French. The prompt provides a sample translation pairing, "How is the weather today?" translated to "Comment est le temps aujourd'hui?" This example serves as a guide for the model to understand the task context and desired format. The model is then tasked with translating a new sentence, "Where is the nearest supermarket?" without further guidance. |
|
| Few-shot prompt | This code snippet classifies emotions using a few-shot learning approach. The prompt includes various examples where statements are associated with their respective emotions. |
|
| Chain-of-thought (CoT) prompting | The Chain-of-Thought (CoT) prompting technique, designed to guide the model through a sequence of reasoning steps to solve a problem. The CoT technique involves structuring the prompt by instructing the model to "Break down each step of your calculation." This encourages the model to include explicit reasoning steps, mimicking human-like problem-solving processes. |
|
| Self-consistency | This code snippet determines the consistent result for age-related problems and generates multiple responses. The 'params' dictionary specifies the maximum number of tokens to generate responses. |
|
| Prompt template | A key concept in LangChain, it helps to translate user input and parameters into instructions for a language model. This can be used to guide a model's response, helping it understand the context and generate relevant and coherent language-based output. |
|
| Text summarization | Text summarization agent designed to help summarize the content you provide to the LLM. You can store the content to be summarized in a variable, allowing for repeated use of the prompt. |
|
| Question answering | An agent that enables the LLM to learn from the provided content and answer questions based on what it has learned. Occasionally, if the LLM does not have sufficient information, it might generate a speculative answer. To manage this, you'll specifically instruct it to respond with "Unsure about the answer" if it is uncertain about the correct response. |
|
| Code generation | An agent that is designed to generate SQL queries based on given descriptions. It interprets the requirements from your input and translates them into executable SQL code. |
|
| Role playing | Configures the LLM to assume specific roles as defined by us, enabling it to follow predetermined rules and behave like a task-oriented chatbot. |
|
| class_names | This code snippet maps numerical labels to their corresponding textual descriptions to classify tasks. This code helps in machine learning to interpret the output model, where the model's predictions are numerical and should be presented in a more human-readable format. |
|
| read_and_split_text | Involves opening the file, reading its contents, and splitting the text into individual paragraphs. Each paragraph represents a section of the company policies. You can also filter out any empty paragraphs to clean your data set. |
|
| encode_contexts | This code snippet encodes a list of texts into embeddings using content_tokenizer and context_encoder. This code helps iterate through each text in the input list, tokenizes and encodes it, and then appends the pooler_output to the embeddings list. The resulting embeddings get stored in the context_embeddings variables and generate embeddings from text data for various natural language processing (NLP) applications. |
|
| import faiss | FAISS (Facebook AI Similarity Search) is an efficient library developed by Facebook for similarity search and clustering of dense vectors. FAISS is designed for fast similarity search, which is particularly valuable when dealing with large data sets. It is highly suitable for tasks in natural language processing where retrieval speed is critical. It effectively handles large volumes of data, maintaining performance even as data set sizes increase. |
|
| search_relevant_contexts | This code snippet is useful in searching relevant contexts for a given question. It tokenizes the question using the question_tokenizer, encodes the question using question_encoder, and searches an index for retrieving the relevant context based on question embedding. |
|
| generate_answer_without_context | This code snippet generates responses using the entered prompt without requiring additional context. It tokenizes the input questions using the tokenizer, generates the output text using the model, and decodes the generated text to obtain the answer. |
|
| Generating answers with DPR contexts | Answers are generated when the model utilizes contexts retrieved via DPR, which are expected to enhance the answer's relevance and depth: |
|
| aggregate_embeddings function | The function aggregate_embeddings takes token indices and their corresponding attention masks, and uses a BERT model to convert these tokens into word embeddings. It then filters out the embeddings for zero-padded tokens and computes the mean embedding for each sequence. This helps in reducing the dimensionality of the data while retaining the most important information from the embeddings. |
|
| text_to_emb | Designed to convert a list of text strings into their corresponding embeddings using a pre-defined tokenizer. |
|
| process_song | Convert both the predefined appropriateness questions and the song lyrics into "RAG embeddings" and measure the similarity between them to determine the appropriateness. |
|
| RAG_QA | This code snippet performs question-answering using question embeddings and provides embeddings. It helps reshape the results for processing, sorting the indices in descending order, and printing the top 'n-responses' based on the highest dot product values. |
|
| model_name_or_path | This code snippet defines the model name to 'gpt2' and initializes the token and model using the GPT-2 model. In this code, add special tokens for padding by keeping the maximum sequence length to 1024. |
|
| add_combined_columns | This code snippet combines the prompt with chosen and rejected responses in a data set example. It combines with the 'Human:' and 'Assistant:' for clarity. This function modifies each example in the 'train' split the data set by creating new columns 'prompt_chosen' and 'prompt_rejected' with the combined text. |
|
| RetrievalQA | This code snippet creates an example for 'RetrievalQA' using a language model and document retriever. |
|
Skills Network Labs Skills Network Labs