top of page

Build an AI app to chat with your PDF using LangChain and OpenAI API


Overview:

  1. Virtual Environment and Dependencies

  2. Import Packages

  3. Define a process text function

  4. Process and extract text from PDF file

  5. Run a Question-Answering Chain with your PDF

  6. Run the app and Sample Results


1. Virtual Environment and Dependencies

Before writing any script, make sure you have everything checked off from this list:

  • Set up an OpenAI API key (if you don't have any)

  • Follow the app structure as defined

  • Store all your secret keys in the .env file

  • (optional) create a launch.json file in VSCODE for debug mode.


Set up an OpenAI API key by following this Quickstart tutorial from OpenAI:


App structure:

pdf_ai_app
|__ .env
|__ main.py
|__ Pipfile

A quick overview of each file and packages:

  • streamlit: for our UI

  • .env: file to store our secret keys, in this case, we only need one for our OpenAI API key as outlined in the OpenAI quickstart

  • main.py: this is where we'll write our script

  • Pipfile: All dependencies to install


Create and store all your secret keys in the .env file, you can insert them in the following format:

OPENAI_API_KEY="your-api-key-here"
# OPENAI_API_KEY="abcdefg"

Activate your virtual environment and install the following the packages (I'm using pipenv in this case, but you can use your preferred dependency management system):

pipenv install langchain faiss-cpu langchain-community langchain-openai tiktoken pypdf2 python-dotenv streamlit

2. Import Packages
# -- for .env and secret keys -- #
import os
from dotenv import load_dotenv
load_dotenv()

# -- UI -- #
import streamlit as st

# -- explanations provided below -- #
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores.faiss import FAISS 
from langchain_openai import OpenAIEmbeddings
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI
embeddings = OpenAIEmbeddings()
llm = OpenAI()

The packages above serve the following purpose:

  • Pdf Reader: Package will extract text from PDF file

  • Text Splitter: Split text into chunks (for details: text splitter)

  • Vector Store: serves as our knowledge base, the text chunks from the pdf will be converted into vectors and stored here. (for details: vector store)

  • Embeddings: The embedding model used to convert text to vectors (for details: text embeddings)

  • Chains: sequence of calls to the language model, tools, etc. (for details: chains)


2. Define a process text function
def process_text(text):
    # Split text into chunks
    text_splitter = CharacterTextSplitter(
        separator="\n", chunk_size=1000, chunk_overlap=50, length_function=len
    )
    chunks = text_splitter.split_text(text)

    # Convert text chunks into embeddings to create vector index
    embeddings = OpenAIEmbeddings()
    vector_index = FAISS.from_texts(chunks, embeddings)

    return vector_index

The process text function will split the text obtained from the PDF file into chunks, convert them into vector embeddings, and store them in a vector store as our knowledge base.


4. Process and Extract text from PDF File
pdf = st.file_uploader("Upload your PDF File", type="pdf")

if pdf:
    pdf_reader = PdfReader(pdf)

    # Initialize an empty string to store text from PDF
    text = ""

    # Iterate through each page and append the extracted text 
    for page in pdf_reader.pages:
        text += page.extract_text()

    vector_index = process_text(text)

We'll use the function we have defined earlier to convert the text extracted from the PDF File into vector embeddings and store these embeddings in our vector store.

  • Start by creating a variable to store the uploaded pdf, called pdf

  • Convert this variable into a pdf_reader object and run a for loop to store the texts in our text string.

  • We'll then use our process_text function defined earlier to process and store these texts in our vector_index, this will serve as the "context" or "knowledge base" for the language model to refer to when asked questions about the uploaded file.


5. Run a Question-Answering Chain with your PDF
query = st.text_input(“Ask a question to the PDF File”)

if query:
    docs = vector_index.similarity_search(query)
    llm = OpenAI()
    chain = load_qa_chain(llm, chain_type=”stuff”)
    input_data = {”input_documents”: docs, “question”: query}
    response = chain.invoke(input=input_data)
    output_text = response.get(output_text)

    # return the answer
    st.write(output_text)
  • Run similarity_search on the vector_index based on a given query, this is usually done with distance measures or functions (the most popular ones are cosine, euclidean, or the dot product, but there are others)

  • Initialize our chain using a LangChain legacy chain load_qa_chain

  • We'll use the "stuff" chain, it takes a list of documents, inserts them all into a prompt and passes that prompt to an LLM

  • We'll pass 2 key-value pairs into our chain as our input: the retrieved documents from our similarity_search, which will serve as "context" to our prompt, and the question that was asked.

  • The answer provided by the LLM will be stored a dictionary and can be retrieved through the key output_text


# -- (optional) print everything returned by the chain -- #
st.write (response)

6. Run the app and Sample results
# run the following line in your CLI or terminal
streamlit run main.py

For the purpose of this demo, I'm going to use the Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks paper






You can also use it ask questions for other types of PDF files, for this demo, I'm using the afternoon tea menu from Fairmont Olympic Seattle:



Comentarios


bottom of page