Building an AI Analyst Agent with LangChain and GPT-4

Apr 5, 202411 min read

Note: In this article, when ReAct is mentioned, I'm referring to the prompting technique in the ReAct: Synergizing Reasoning and Acting in Language Models paper. [2]

Github page for full code: https://github.com/eyereece/ai-projects/tree/main/ai-analyst-agent

Introduction

We can significantly boost the capabilities of Large Language Models (LLMs) by equipping them with supplementary tools. For example, LLMs can leverage tools for various functionalities, such as booking a meeting, generating codes, or fetching additional information.

In this article, my goal is to showcase the effectiveness of agent workflow in the context of data analysis and visualization. The workflow is designed to address queries such as:

create one scatter plot showing the correlation between the length of review_description and reviewer_rating in the dataframe. Tell me if there's a correlation between them or not.

In this example, it correctly answers that there's no correlation between them based on the pattern and displays the plot it has generated for us.

The Analyst Agent isn't limited to generating plots and analyzing data, you can also ask it more general questions related to your data. In the example below, we can ask the Agent to recommend a book for a Taylor Swift fan, and it will use the tools we'll create to look for items related to the topic we're looking for, in this case, Taylor Swift, and recommend the most relevant item based on the given query.

Agent WorkFlow

The fundamental concept behind agents involves using a language model to select a sequence of actions. In traditional chains, this sequence of actions is pre-determined and hardcoded within the code. However, in agents, a language model serves as a reasoning engine, dynamically deciding which actions to execute and in what sequence. [1]

The Analyst Agent we will see in this article is provided with a predetermined set of tools and instructions. These instructions guide the Agent on when and how to use the tools based on user input, it is also guided on how to find an answer given a problem through the use of the ReAct prompting technique, where an LLM generates both reasoning ability and task-specific actions in an interleaved manner. [2]

An overview of the Agent workflow we will implement in this article

Let's take a look at an example of the agent's thought process, when given the query: "Are most reviews for The Ballad of Songbirds and Snakes positive or negative?"

The Agent uses a retriever tool named "books_review_search" to find reviews related to the book title in the query. Once it has relevant reviews, the Agent used a different tool to find the rating. Based on its analysis of the rating, it forms an answer to the query, telling the user if the reviews are mostly positive or negative.

Dataset

We'll use the 100 best-selling book reviews dataset as an example for this article, the dataset is licensed under Creative Commons 4.0 and can be used and redistributed for any purpose, which is available here: https://www.kaggle.com/datasets/anshtanwar/top-200-trending-books-with-reviews

It consists of 920 entries of reviews on the top 100 best-selling books and include the book's title, review title and entries, along with its rating.

Tools

Our Agent will have access to the following tools to create plots and interact with a given CSV document.

Retriever Tool: A tool through which it can lookup an item related to the query by name using vector similarity search. It is used to find relevant chunk of documents that contain the answer to a given query.
Python REPL Tool: A tool LLM can use to execute existing Python code within the agent.

Defining Tools

We will now define the tools our agent can use to interact with the CSV document. We will start with the retriever tool. Retriever tool is designed to fetch relevant information about books.

# configurations
PAGE_CONTENT_COL = "book_name"
CONTENT_COL_SEARCH = "books_review_search"
embedding_model = OpenAIEmbeddings()

# load as df
df = pd.read_csv(uploaded_file)

# load as vector store
loader = DataFrameLoader(df, page_content_column=PAGE_CONTENT_COL)
docs = loader.load()

index_creator = VectorstoreIndexCreator(vectorstore_cls=FAISS)
vectorstore = FAISS.from_documents(docs, embedding_model)

# create retriever tool
retriever_tool = create_retriever_tool(
        vectorstore.as_retriever(),
        name=CONTENT_COL_SEARCH,
        description="Search for customer reviews",
)

To create our retriever tool for CSV document, we will first set 2 variables:

page content column (PAGE_CONTENT_COL): Name of the column containing the page content. In this case, the name of the column to look up books in our dataset is called "book_name"
name of the tool (CONTENT_COL_SEARCH): The name of the retriever tool that will be passed to our instructions prompt for the LLM. We'll name our retriever tool "books_review_search"

In the code above, we will first load a dataset of customer reviews on books stored in a CSV file as a pandas dataframe. We will then create our vector index using FAISS and OpenAI Embeddings to convert the unstructured book names and reviews into vectors to facilitate information retrieval. Finally, we will build our retriever tool that allows our agent to search for reviews based on the book's title.

Next, we will create our Python code executor tool, which functions as a Python interpreter within the agent.

# Create REPL tool
class PythonInputs(BaseModel):
     query: str = Field(description="code snippet to run")

repl = PythonAstREPLTool(
    locals={"df": df},
    name="python_repl",
    description="Runs code and returns the output of the final line",
    args_schema=PythonInputs,
)

In our code above, we created a Python REPL (Read-Eval-Print-Loop) tool named "python_repl" within the agent, allowing the agent to execute Python code snippets and interact with data directly. It accepts code as input, provides access to our dataframe, which we have loaded earlier as "df", and returns the output/return value of that code.

React and CSV Prompt for LLM Agent

In this implementation, I slightly modified two prompts from the existing hwchase17/react prompt available in LangChain hub and csv-agent prompt available in LangChain CSV Agent Template. It combines two prompt templates (react_template and csv_template) to provide instructions for an agent working with a CSV file:

Data Understanding: The csv_template first helps the agent understand the data by explaining when and how to interact with the data using the available tools we created earlier along with few-shot examples.
Reasoning-and-Acting: The react_template defines a question-and-answer format using a loop of Thought, Action, Input, and Observation to help the agent gather information and arrive at a final answer to the original question.

The combination of the two prompts equips the agent with a structured approach to work with the CSV file and answer questions using its available tools.

csv_template = """You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
It is important to understand the attributes of the dataframe before working with it. This is the result of running `df.head().to_markdown()`

<df>
{dhead}
</df>

You are not meant to use only these rows to answer questions - they are meant as a way of telling you about the shape and schema of the dataframe.
You also do not have to use only the information here to answer questions - you can run intermediate queries to do exporatory data analysis to give you more information as needed.

You have a tool called {CONTENT_COL_SEARCH} through which you can lookup a book by name and find the records corresponding to reviews with similar name as the query.
You should only really use this if your search term contains a books name. Otherwise, try to solve it with code.
The dataset can be accessed with a variable named 'df'.
Remember to import necessary libraries if needed.

For example:

<question>What is the review for the book How to Catch a Turkey?</question>
<logic>Use {CONTENT_COL_SEARCH} since you can use the query `How to Catch a Turkey`</logic>

<question>Which book name has the highest mean reviewer rating?</question>
<logic>Use `python_repl` since even though the question is about a book, you don't know its name so you can't include it.</logic>
"""


react_template = """

{instructions}

TOOLS:
------
You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}
"""

To combine the two prompts, I passed the csv_template along with its formatting as instructions to the react prompt with the following code:

instructions = csv_template.format(dhead=df.head().to_markdown(), 
						  CONTENT_COL_SEARCH=CONTENT_COL_SEARCH)
base_prompt = PromptTemplate.from_template(react_template)
full_prompt = base_prompt.partial(instructions=instructions)

Putting it all together for the Agent

Finally, we will initialize our react agent with our full prompt and tools we have created earlier, along with a language model to power the agent, in this case, GPT-4.

Creating the react agent

When creating the agent, it takes in several arguments: LLM, tools, and prompt we have set up earlier. This prompt acts as a template guiding the agent's interaction format.

Recall that we didn't use the raw template directly and instead, created a full_prompt by partially filling the base prompt with additional information specific to our task, which is, how to handle queries related to CSV documents and the tools to use with it.

The create_react_agent function then connects these elements:

Maintains an "agent_scratchpad" to store the history of the agent's actions and tool outputs.
Connects the "full_prompt" to the LLM, defining how the LLM interacts with the prompt and tools based on the instructions and references provided.
Finally, it adds an output parser to interpret the LLM's response according to the ReAct prompting format.
We will also set streaming=True to stream the intermediate steps an agent takes for our chat interface later. [3]

After assembling these components, it returns a complete "Runnable" agent object. This object, especially when configured with 'streaming=True', can be used to stream intermediate steps for a chat interface, allowing for a more interactive experience.

GPT_MODEL = "gpt-4"

react_agent = create_react_agent(
            llm=ChatOpenAI(temperature=0, model=GPT_MODEL, streaming=True),
            tools=tools,
            prompt=full_prompt,
        )

After creating our react agent, we will then use the AgentExecutor instance, which acts as a manager for the react agent. It controls the execution loop where the agent interacts with the environment through the provided tools. In a nutshell, here's what it does:

The agent takes a step using its internal logic and prompt/observations.
The AgentExecutor uses the chosen action and interacts with the appropriate tool from the tools list.
The tool's response becomes the next observation for the agent.
It will repeat the previous steps until the agent finishes or the AgentExecutor stops based on pre-defined conditions (max steps or time limit)


agent_executor = AgentExecutor(
            agent=react_agent,
            tools=tools,
            verbose=True,
            return_intermediate_steps=True,
            handle_parsing_errors=True,
        )

In the implementation above, we return our agent's thought process by setting return_intermediate_steps to True. By doing this, we have access to the thought process and not just the final output, this will be helpful to get more visibility into what an agent is doing and extracting information not visible in the final output, for example, the codes the agent has generated to arrive towards its final answer.

Note: setting verbose=True will log the intermediate steps to the CLI, setting handle_parsing_errors=True sends parsing errors as observations for the agent to potentially fix.

Streamlit Chat Interface

Now that we have set everything up for our Analyst agent, we can connect it with Streamlit to visualize the agent's response. Apart from returning the final output of the agent, we also want to

Visualize and stream the agent's thought process
Display the plot created by the agent, if any.

To display the agent's thought process, we will use the StreamlitCallbackHandler. It does this by creating sections (expanders) for each LLM thought and render the output, uses labels to explain stages (start, complete), and hides older thoughts.

query = st.chat_input("Ask an AI Analyst about your data")
if query:
    st.chat_message("user").write(query)

    with st.chat_message("assistant"):
         st_callback = StreamlitCallbackHandler(st.container())
         response = agent_executor.invoke(
              {"input": query}, {"callbacks": [st_callback]}
          )

As Streamlit handles plots differently on the chat interface, we have to find a way to display the plots on the Streamlit UI. This can be done by extracting the code blocks in the agent's intermediate_steps and running it in the chat interface as it detects them. To do that, we have to create 2 helper functions to do the following:

Find and extract the code block in the agent's thought process
Execute and render the plot that displays the plot in the Streamlit interface

Before creating our helper functions to find the code block, we have to know how the response generated would look like, this can be done by printing the response, in which the agent returns a dictionary in the following format when it generates code, and the code within the response is stored under the tool_input attribute of the first element in the list.

{
 ‘input’: ‘…’,
 ‘output’: ‘…’,
 ‘intermediate_steps’: [(AgentAction(tool=’…’, tool_input=’…’, log=’…’),
 						    ‘error message will go here (if any)’)]
}

We can then extract the response where the code is stored, the agent's response structure may vary depending on the tool used. To handle these potential variations, we will first try with a potentially nested structure and then fallback to a handle a simpler format if encountered, if no tools were used at all, we will return the final output from the response.

def extract_tool_and_input(response):
    try:
        if len(response.get("intermediate_steps", [])) < 1:
            raise ValueError("No intermediate steps found")

        try:
            # Prioritize single-nested tool and tool_input
            tool_data = response["intermediate_steps"][0][0]
            tool = tool_data.tool if hasattr(tool_data, 'tool') else None
            tool_input = tool_data.tool_input if hasattr(tool_data, 'tool_input') else None
        except (IndexError, AttributeError) as e:
            # Fallback to non-nested tool and tool_input
            tool_data = response["intermediate_steps"][0]
            tool = tool_data.tool if hasattr(tool_data, 'tool') else None
            tool_input = tool_data.tool_input if hasattr(tool_data, 'tool_input') else None
            print(f"Handling single-nested structure: {str(e)}")

        if tool is None or tool_input is None:
            # Handle empty tool or tool_input
            st.write(response["output"])
            return None, None

        return tool, tool_input

    except (IndexError, AttributeError, ValueError) as e:
        print(f"Error processing response: {str(e)}")
        return None, None

# Usage:
tool, tool_input = extract_tool_and_input(response)

Find and extract code block

Once we find the response where the code is located, we can extract the code block from the response.

Let's first take a look at what tool_input might look like:

tool_input="```python\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\n# Convert 'date' column to datetime format\ndf['date'] = pd.to_datetime(df['date'])\n\n# Group by date and count the number of reviews\nreviews_over_time = df.groupby('date').size()\n\n# Plot the data\nplt.figure(figsize=(10,6))\nreviews_over_time.plot(kind='line')\nplt.title('Number of Reviews Over Time')\nplt.xlabel('Date')\nplt.ylabel('Number of Reviews')\nplt.show()\n```"

We know from looking at the output above, the code is stored within triple backticks after python.

To extract this, we can use a regular expression to locate code sections enclosed within triple backticks. The pattern can handle the presence of the word "python" after the opening backtick.

To simplify extraction, we will use a named captured group named "code_block" to capture the actual code content. Finally, we'll combine all the extracted code blocks into a single string separated by newlines, making sure the pattern can match code with multiple lines using re.DOTALL.

import re

def extract_code(response):
    PATTERN = r"(`(python)?(?P<code_block>.*?)`)"
    code = "\n".join(match.group("code_block").strip() for match in re.finditer(PATTERN, response, re.DOTALL))

    return code

Execute Code and Display the Plot

After extracting the code, we can then execute and render the code as a plot, this is done using exec() while providing access to global and local variables. We will then check for plots created with Matplotlib and display them with Streamlit if found.

def execute_code(response, df):
    code = extract_code(response)

    # if theres code in response, try to execute
    if code:
        try:
            exec(code, globals(), locals())
            if "plt" in locals() and plt.gcf().get_axes():
                # get current figure and display it
                fig = plt.gcf()
                st.pyplot(fig)
            else:
                print("No plot detected")
        except Exception as e:
            st.error(f"{str(e)}")
    else:
        print("No code detected")

Note: It's important to be cautious about potential global namespace pollution, especially when working in a larger codebase.

Running the Agent

Now that we have everything set up, we can put everything together and have our Analyst Agent create plots and analyze the result for us.

query = st.chat_input("Ask an AI Analyst about your data")

if query:
   st.chat_message("user").write(query)

   with st.chat_message("assistant"):
        st_callback = StreamlitCallbackHandler(st.container())
        response = agent_executor.invoke(
                    {"input": query}, {"callbacks": [st_callback]}
        )

        tool, tool_input = extract_tool_and_input(response)

        if tool == "python_repl":
              try:
                  execute_code(tool_input, df)
              except Exception as e:
                  print(f"Error executing code: {str(e)}")        
         else:
              st.write(response["output"])

In the example below, I asked the Analyst Agent to create one pie chart showing the percentages of "reviewer_rating" in the dataframe. The Agent identified the need for code execution and used the "python_repl" tool, the generated code was then executed using the helper functions that we have created earlier and displayed the resulting pie chart.

It can also answer more general questions you might have about your data, for example: "Can you tell me what this dataset is about?"

Conclusion

In this article, we built an AI Analyst Agent, using LangChain, GPT-4, and Streamlit. We provided our agent with two tools which helps the agent in running code and looking for an item based on a given query through the use of a vector retriever tool. We also combined two prompts to help the agent understand how to approach a problem given a question, and which tool to use given a situation.

We used sample book reviews data by injecting the data into a vector store. We then turned it into query engines to help the agent look for information given a query. We also provided a Python REPL tool to help the agent execute commands and calculate answers. Finally, we allow the chat interface to dynamically execute code if a plot is detected and displays the plot to the end user.

With the AI Analyst Agent, we allow end users to interact with data through natural language queries and reduce the use for complex code, making data exploration accessible to a wider audience.

References

[1] Langchain Documentation, Agents (2023). Documentation

[2] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629v3 [cs.CL]

[3] Langchain Documentation, Streaming with LangChain (2023). Documentation