Github Page for full code: https://github.com/eyereece/ai-rag-experiments/tree/main/chat-with-your-pdf-ai
Overview:
Virtual Environment and Dependencies
Import Packages
Define a process text function
Process and extract text from PDF file
Run a Question-Answering Chain with your PDF
Run the app and Sample Results
1. Virtual Environment and Dependencies
Before writing any script, make sure you have everything checked off from this list:
Set up an OpenAI API key (if you don't have any)
Follow the app structure as defined
Store all your secret keys in the .env file
(optional) create a launch.json file in VSCODE for debug mode.
Set up an OpenAI API key by following this Quickstart tutorial from OpenAI:
App structure:
pdf_ai_app
|__ .env
|__ main.py
|__ Pipfile
A quick overview of each file and packages:
streamlit: for our UI
.env: file to store our secret keys, in this case, we only need one for our OpenAI API key as outlined in the OpenAI quickstart
main.py: this is where we'll write our script
Pipfile: All dependencies to install
Create and store all your secret keys in the .env file, you can insert them in the following format:
OPENAI_API_KEY="your-api-key-here"
# OPENAI_API_KEY="abcdefg"
Activate your virtual environment and install the following the packages (I'm using pipenv in this case, but you can use your preferred dependency management system):
pipenv install langchain faiss-cpu langchain-community langchain-openai tiktoken pypdf2 python-dotenv streamlit
2. Import Packages
# -- for .env and secret keys -- #
import os
from dotenv import load_dotenv
load_dotenv()
# -- UI -- #
import streamlit as st
# -- explanations provided below -- #
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores.faiss import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI
embeddings = OpenAIEmbeddings()
llm = OpenAI()
The packages above serve the following purpose:
Pdf Reader: Package will extract text from PDF file
Text Splitter: Split text into chunks (for details: text splitter)
Vector Store: serves as our knowledge base, the text chunks from the pdf will be converted into vectors and stored here. (for details: vector store)
Embeddings: The embedding model used to convert text to vectors (for details: text embeddings)
Chains: sequence of calls to the language model, tools, etc. (for details: chains)
2. Define a process text function
def process_text(text):
# Split text into chunks
text_splitter = CharacterTextSplitter(
separator="\n", chunk_size=1000, chunk_overlap=50, length_function=len
)
chunks = text_splitter.split_text(text)
# Convert text chunks into embeddings to create vector index
embeddings = OpenAIEmbeddings()
vector_index = FAISS.from_texts(chunks, embeddings)
return vector_index
The process text function will split the text obtained from the PDF file into chunks, convert them into vector embeddings, and store them in a vector store as our knowledge base.
4. Process and Extract text from PDF File
pdf = st.file_uploader("Upload your PDF File", type="pdf")
if pdf:
pdf_reader = PdfReader(pdf)
# Initialize an empty string to store text from PDF
text = ""
# Iterate through each page and append the extracted text
for page in pdf_reader.pages:
text += page.extract_text()
vector_index = process_text(text)
We'll use the function we have defined earlier to convert the text extracted from the PDF File into vector embeddings and store these embeddings in our vector store.
Start by creating a variable to store the uploaded pdf, called pdf
Convert this variable into a pdf_reader object and run a for loop to store the texts in our text string.
We'll then use our process_text function defined earlier to process and store these texts in our vector_index, this will serve as the "context" or "knowledge base" for the language model to refer to when asked questions about the uploaded file.
5. Run a Question-Answering Chain with your PDF
query = st.text_input(“Ask a question to the PDF File”)
if query:
docs = vector_index.similarity_search(query)
llm = OpenAI()
chain = load_qa_chain(llm, chain_type=”stuff”)
input_data = {”input_documents”: docs, “question”: query}
response = chain.invoke(input=input_data)
output_text = response.get(output_text)
# return the answer
st.write(output_text)
Run similarity_search on the vector_index based on a given query, this is usually done with distance measures or functions (the most popular ones are cosine, euclidean, or the dot product, but there are others)
Initialize our chain using a LangChain legacy chain load_qa_chain
We'll use the "stuff" chain, it takes a list of documents, inserts them all into a prompt and passes that prompt to an LLM
We'll pass 2 key-value pairs into our chain as our input: the retrieved documents from our similarity_search, which will serve as "context" to our prompt, and the question that was asked.
The answer provided by the LLM will be stored a dictionary and can be retrieved through the key output_text
# -- (optional) print everything returned by the chain -- #
st.write (response)
6. Run the app and Sample results
# run the following line in your CLI or terminal
streamlit run main.py
For the purpose of this demo, I'm going to use the Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks paper


You can also use it ask questions for other types of PDF files, for this demo, I'm using the afternoon tea menu from Fairmont Olympic Seattle:

댓글