New

Ask questions of your PDFs with Unstructured

Notebook

SingleStore Notebooks

Ask questions of your PDFs with Unstructured

Install Unstructured Library

We'll start by installing the Unstructured library, which is essential for ingesting and processing PDF files. The library will allow us to convert PDF documents into a JSON format that includes both metadata and text extraction. For this part of the project, we'll focus on installing the PDF support components.

Reference for full installation details: Unstructured Installation Guide

In [1]:

1
!pip install "unstructured[pdf]"

Import Libraries

In this section, we import the necessary libraries for our project. We'll use pandas to handle data manipulation, converting our semi-structured JSON data into a structured DataFrame format. This is crucial for storing the data in the SingleStore database later on. Additionally, we'll utilize the OpenAI API for vectorizing text and generating responses, integral components of our RAG system.

In [2]:

1
import os
2
import json
3
import mysql.connector
4
import pandas as pd
5
import numpy as np
6

7
import openai
8
from openai.embeddings_utils import get_embedding

Configure OpenAI API and SingleStore Database

Before we proceed, it's important to configure our environment. This involves setting up access to the OpenAI API and the SingleStore cloud database. You'll need to retrieve your OpenAI API key and establish a connection with the SingleStore database. These steps are fundamental for enabling the interaction between our AI models and the database.

Obtain your OpenAI API key here: OpenAI API Key
Set up your SingleStore account and workspace: SingleStore Setup Guide
Connect to your SingleStore workspace: SingleStore Connection Documentation

In [3]:

1
# OpenAI API Key
2
openai.api_key = os.environ["OPENAI_API_KEY"]
3

4
# SingleStore DB Connection
5
host=os.environ["SS_HOST"]
6
port=3306
7
username=os.environ["SS_USERNAME"]
8
password=os.environ["SS_PASSWORD"]
9
database=os.environ["SS_DATABASE"]

Unstructured PDF Partition

The PDF Partition step is critical for ingesting and processing the PDF document. Here, we define the filename of the PDF to be processed. We then use the partition_pdf function to segment the PDF document, extracting various elements such as text, images, and tables. The function can execute locally or make a call to a remote inference server, depending on your setup.

Additionally, the chunk_by_title function is used to organize the document into sections based on the presence of titles, with non-text elements being treated as separate sections. The "fast" strategy is applied for quick text extraction, which is suitable for text-heavy PDFs.

References:

In [4]:

1
pdf_filename = "Employee-Handbook.pdf"

In [5]:

1
from unstructured.partition.pdf import partition_pdf
2
from unstructured.chunking.title import chunk_by_title
3

4
elements = partition_pdf(pdf_filename,
5
                         strategy="fast",
6
                        )
7

8
chunks = chunk_by_title(elements)

Reformat JSON Output into Structured Dataframe Format

After processing the PDF, we receive output in an unstructured JSON format, which includes valuable metadata about the extracted elements. This metadata enables us to filter and manipulate the document elements based on our requirements. Our next step is to convert this JSON output into a structured DataFrame, which is a more suitable format for storing in the SingleStore DB and for further processing in our RAG system.

Reference for understanding metadata: Unstructured Metadata Documentation

In [6]:

1
# Convert JSON output into Pandas DataFrame
2
data = []
3

4
for c in chunks:
5
    row = {}
6
    row['Element Type'] = type(c).__name__
7
    row['Filename'] = c.metadata.filename
8
    row['Date Modified'] = c.metadata.last_modified
9
    row['Filetype'] = c.metadata.filetype
10
    row['Page Number'] = c.metadata.page_number
11
    row['text'] = c.text
12
    data.append(row)
13

14
df = pd.DataFrame(data)
15

16
# Show the DataFrame
17
df.head()

Make Connection to SingleStore Database

In this step, we establish a connection to the SingleStore Database using the MySQL connector. This connection is vital for creating a new table that matches the structure of our DataFrame and for uploading our data. SingleStoreDB Cloud's compatibility with MySQL allows us to leverage its tools for managing data and executing data-related tasks efficiently.

References:

In [7]:

1
# Create connection to S2 Database
2
cnx = mysql.connector.connect(user=username,
3
                              password=password,
4
                              host=host,
5
                              database=database)
6
cnx

In [8]:

1
# Drop the existing table
2
drop_cursor = cnx.cursor()
3
drop_query = "DROP TABLE IF EXISTS unstructured_data;"
4
drop_cursor.execute(drop_query)
5

6
# Create a new table
7
create_cursor = cnx.cursor()
8
create_query = ("CREATE TABLE unstructured_data ("
9
                "element_id INT AUTO_INCREMENT PRIMARY KEY, "
10
                "element_type VARCHAR(255), "
11
                "filename VARCHAR(255), "
12
                "date_modified DATETIME, "
13
                "filetype VARCHAR(255), "
14
                "page_number INT, "
15
                "text TEXT);")
16
create_cursor.execute(create_query)
17

18
cnx.commit()
19
drop_cursor.close()
20
create_cursor.close()

In [9]:

1
cursor = cnx.cursor()
2

3
# Loop through the DataFrame and insert each row into the table
4
for i, row in df.iterrows():
5
    insert_query = """INSERT INTO unstructured_data (element_type, filename, date_modified, filetype, page_number, text)
6
                      VALUES (%s, %s, %s, %s, %s, %s);"""
7
    cursor.execute(insert_query, (row['Element Type'], row['Filename'], row['Date Modified'], row['Filetype'], row['Page Number'], row['text']))
8

9
cnx.commit()
10
cursor.close()

Create Text Embedding in the Table

Next, we enhance our database table by adding a new column for text embeddings. Using OpenAI's get_embedding function, we generate embeddings that measure the relatedness of text strings. These embeddings are particularly useful for search functionality, allowing us to rank results by relevance.

Reference: Understanding Text Embeddings

In [10]:

1
cursor = cnx.cursor(buffered=True)
2

3
# Add a new column for text embedding
4
alter_query = "ALTER TABLE unstructured_data ADD text_embedding TEXT;"
5
cursor.execute(alter_query)

In [11]:

1
# Select and embed all text in table
2
query = "SELECT text FROM unstructured_data;"
3
cursor.execute(query)
4
rows = cursor.fetchall()
5

6
for i in rows:
7
    text_embedding = json.dumps(get_embedding(i[0], engine="text-embedding-ada-002"))
8
    update_query = ("UPDATE unstructured_data SET text_embedding = %s WHERE text = %s;")
9
    data = (text_embedding, i[0])
10
    cursor.execute(update_query, data)
11

12
cnx.commit()
13
cursor.close()

Run User Query Based on Similarity Score

The retrieval process begins by selecting the table and text embeddings from our database. We then calculate similarity scores using numpy's dot product function, comparing the user query embeddings with the document embeddings. This allows us to identify and select the top-5 most similar entries, which are most relevant to the user's query.

Reference: How the Dot Product Measures Similarity

In [12]:

1
# User query
2
search_string = "What are the emergency management provisions include?"
3
search_embedding = get_embedding(search_string, engine="text-embedding-ada-002")
4
search_embedding_array = np.array(search_embedding)

In [13]:

1
cursor = cnx.cursor()
2

3
# Fetch text, type, filename, and embeddings from the unstructured_data table
4
query = "SELECT text, element_type, filename, text_embedding FROM unstructured_data;"
5
cursor.execute(query)
6

7
results = cursor.fetchall()
8

9
# Compute dot product scores
10
scores = []
11
for res in results:
12
    text = res[0]
13
    type_ = res[1]
14
    filename = res[2]
15
    embedding_str = res[3]
16

17
    if embedding_str is not None:
18
        embedding = json.loads(embedding_str)
19
        embedding_array = np.array(embedding)
20

21
        # Compute dot product for all records
22
        score = np.dot(search_embedding_array, embedding_array)
23
        scores.append((text, type_, filename, score))
24

25
# Sort by score and take the top 5
26
top_5 = sorted(scores, key=lambda x: x[3], reverse=True)[:5]
27

28
# Close the connection
29
cursor.close()
30
cnx.close()
31

32
# Display top-k records
33
top_5

Generate the Answer via OpenAI ChatCompletion

In the final step, we take the top-5 most similar entries retrieved from the database and use them as input for OpenAI's ChatCompletion. The ChatCompletion model is designed for both multi-turn conversations and single-turn tasks. It takes a list of messages as input and returns a model-generated message as output, providing us with a coherent and contextually relevant response based on the retrieved documents.

Reference: OpenAI Chat Completions API Guide

In [14]:

1
if top_5:
2
    try:
3
        response = openai.ChatCompletion.create(
4
            model="gpt-4",
5
            messages=[
6
                {"role": "system",
7
                 "content": "You are a useful assistant. Use the assistant's content to answer the user's query. Summarize your answer based on the context."
8
                },
9
                {"role": "assistant", "content": str(top_5)},
10
                {"role": "user", "content": search_string},
11
            ],
12
            temperature=0
13
        )
14

15
        assistant_message = response['choices'][0]['message']['content']
16
        print("Assistant's Response:", assistant_message)
17

18
    except Exception as e:
19
        print(f"OpenAI API call failed: {e}")
20
else:
21
    print("No relevant documents found.")

Details

About this Template

Ask questions of your unstructured PDFs. In this notebook, Unstructured.io ingests pdfs accurately, then Open AI is used to create embeddings, the vector data is stored in SingleStore and finally ask questions of your PDF data

This Notebook can be run in Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

Ask questions of your PDFs with Unstructured

Notebook

Ask questions of your PDFs with Unstructured

Install Unstructured Library

Import Libraries

Configure OpenAI API and SingleStore Database

Unstructured PDF Partition

Reformat JSON Output into Structured Dataframe Format

Make Connection to SingleStore Database

Create Text Embedding in the Table

Run User Query Based on Similarity Score

Generate the Answer via OpenAI ChatCompletion

Details

About this Template

Tags

License

See Notebook in action