Build an AI app to chat with your PDF using LangChain and OpenAI API

Mar 8, 20244 min read

Github Page for full code: https://github.com/eyereece/ai-rag-experiments/tree/main/chat-with-your-pdf-ai

Overview:

Virtual Environment and Dependencies
Import Packages
Define a process text function
Process and extract text from PDF file
Run a Question-Answering Chain with your PDF
Run the app and Sample Results

1. Virtual Environment and Dependencies

Before writing any script, make sure you have everything checked off from this list:

Set up an OpenAI API key (if you don't have any)
Follow the app structure as defined
Store all your secret keys in the .env file
(optional) create a launch.json file in VSCODE for debug mode.

Set up an OpenAI API key by following this Quickstart tutorial from OpenAI:

OpenAI Quickstart

App structure:

pdf_ai_app
|__ .env
|__ main.py
|__ Pipfile

A quick overview of each file and packages:

streamlit: for our UI
.env: file to store our secret keys, in this case, we only need one for our OpenAI API key as outlined in the OpenAI quickstart
main.py: this is where we'll write our script
Pipfile: All dependencies to install

Create and store all your secret keys in the .env file, you can insert them in the following format:

OPENAI_API_KEY="your-api-key-here"
# OPENAI_API_KEY="abcdefg"

Activate your virtual environment and install the following the packages (I'm using pipenv in this case, but you can use your preferred dependency management system):

pipenv install langchain faiss-cpu langchain-community langchain-openai tiktoken pypdf2 python-dotenv streamlit

2. Import Packages

# -- for .env and secret keys -- #
import os
from dotenv import load_dotenv
load_dotenv()

# -- UI -- #
import streamlit as st

# -- explanations provided below -- #
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores.faiss import FAISS 
from langchain_openai import OpenAIEmbeddings
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI

embeddings = OpenAIEmbeddings()
llm = OpenAI()

The packages above serve the following purpose:

Pdf Reader: Package will extract text from PDF file
Text Splitter: Split text into chunks (for details: text splitter)
Vector Store: serves as our knowledge base, the text chunks from the pdf will be converted into vectors and stored here. (for details: vector store)
Embeddings: The embedding model used to convert text to vectors (for details: text embeddings)
Chains: sequence of calls to the language model, tools, etc. (for details: chains)

2. Define a process text function

def process_text(text):
    # Split text into chunks
    text_splitter = CharacterTextSplitter(
        separator="\n", chunk_size=1000, chunk_overlap=50, length_function=len
    )
    chunks = text_splitter.split_text(text)

    # Convert text chunks into embeddings to create vector index
    embeddings = OpenAIEmbeddings()
    vector_index = FAISS.from_texts(chunks, embeddings)

    return vector_index

The process text function will split the text obtained from the PDF file into chunks, convert them into vector embeddings, and store them in a vector store as our knowledge base.

4. Process and Extract text from PDF File

pdf = st.file_uploader("Upload your PDF File", type="pdf")

if pdf:
    pdf_reader = PdfReader(pdf)

    # Initialize an empty string to store text from PDF
    text = ""

    # Iterate through each page and append the extracted text 
    for page in pdf_reader.pages:
        text += page.extract_text()

    vector_index = process_text(text)

We'll use the function we have defined earlier to convert the text extracted from the PDF File into vector embeddings and store these embeddings in our vector store.

Start by creating a variable to store the uploaded pdf, called pdf
Convert this variable into a pdf_reader object and run a for loop to store the texts in our text string.
We'll then use our process_text function defined earlier to process and store these texts in our vector_index, this will serve as the "context" or "knowledge base" for the language model to refer to when asked questions about the uploaded file.

5. Run a Question-Answering Chain with your PDF

query = st.text_input(“Ask a question to the PDF File”)

if query:
    docs = vector_index.similarity_search(query)
    llm = OpenAI()
    chain = load_qa_chain(llm, chain_type=”stuff”)
    input_data = {”input_documents”: docs, “question”: query}
    response = chain.invoke(input=input_data)
    output_text = response.get(output_text)

    # return the answer
    st.write(output_text)

Run similarity_search on the vector_index based on a given query, this is usually done with distance measures or functions (the most popular ones are cosine, euclidean, or the dot product, but there are others)
Initialize our chain using a LangChain legacy chain load_qa_chain
We'll use the "stuff" chain, it takes a list of documents, inserts them all into a prompt and passes that prompt to an LLM
We'll pass 2 key-value pairs into our chain as our input: the retrieved documents from our similarity_search, which will serve as "context" to our prompt, and the question that was asked.
The answer provided by the LLM will be stored a dictionary and can be retrieved through the key output_text

# -- (optional) print everything returned by the chain -- #
st.write (response)

6. Run the app and Sample results

# run the following line in your CLI or terminal
streamlit run main.py

For the purpose of this demo, I'm going to use the Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks paper

You can also use it ask questions for other types of PDF files, for this demo, I'm using the afternoon tea menu from Fairmont Olympic Seattle: