Ask questions of your PDFs with Unstructured


SingleStore Notebooks

Ask questions of your PDFs with Unstructured

Install Unstructured Library

We'll start by installing the Unstructured library, which is essential for ingesting and processing PDF files. The library will allow us to convert PDF documents into a JSON format that includes both metadata and text extraction. For this part of the project, we'll focus on installing the PDF support components.

Reference for full installation details: Unstructured Installation Guide

In [1]:

!pip install "unstructured[pdf]"

Import Libraries

In this section, we import the necessary libraries for our project. We'll use pandas to handle data manipulation, converting our semi-structured JSON data into a structured DataFrame format. This is crucial for storing the data in the SingleStore database later on. Additionally, we'll utilize the OpenAI API for vectorizing text and generating responses, integral components of our RAG system.

In [2]:

import os
import json
import mysql.connector
import pandas as pd
import numpy as np
import openai
from openai.embeddings_utils import get_embedding

Configure OpenAI API and SingleStore Database

Before we proceed, it's important to configure our environment. This involves setting up access to the OpenAI API and the SingleStore cloud database. You'll need to retrieve your OpenAI API key and establish a connection with the SingleStore database. These steps are fundamental for enabling the interaction between our AI models and the database.

In [3]:

# OpenAI API Key
openai.api_key = os.environ["OPENAI_API_KEY"]
# SingleStore DB Connection

Unstructured PDF Partition

The PDF Partition step is critical for ingesting and processing the PDF document. Here, we define the filename of the PDF to be processed. We then use the partition_pdf function to segment the PDF document, extracting various elements such as text, images, and tables. The function can execute locally or make a call to a remote inference server, depending on your setup.

Additionally, the chunk_by_title function is used to organize the document into sections based on the presence of titles, with non-text elements being treated as separate sections. The "fast" strategy is applied for quick text extraction, which is suitable for text-heavy PDFs.


In [4]:

pdf_filename = "Employee-Handbook.pdf"

In [5]:

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
elements = partition_pdf(pdf_filename,
chunks = chunk_by_title(elements)

Reformat JSON Output into Structured Dataframe Format

After processing the PDF, we receive output in an unstructured JSON format, which includes valuable metadata about the extracted elements. This metadata enables us to filter and manipulate the document elements based on our requirements. Our next step is to convert this JSON output into a structured DataFrame, which is a more suitable format for storing in the SingleStore DB and for further processing in our RAG system.

Reference for understanding metadata: Unstructured Metadata Documentation

In [6]:

# Convert JSON output into Pandas DataFrame
data = []
for c in chunks:
row = {}
row['Element Type'] = type(c).__name__
row['Filename'] = c.metadata.filename
row['Date Modified'] = c.metadata.last_modified
row['Filetype'] = c.metadata.filetype
row['Page Number'] = c.metadata.page_number
row['text'] = c.text
df = pd.DataFrame(data)
# Show the DataFrame

Make Connection to SingleStore Database

In this step, we establish a connection to the SingleStore Database using the MySQL connector. This connection is vital for creating a new table that matches the structure of our DataFrame and for uploading our data. SingleStoreDB Cloud's compatibility with MySQL allows us to leverage its tools for managing data and executing data-related tasks efficiently.


In [7]:

# Create connection to S2 Database
cnx = mysql.connector.connect(user=username,

In [8]:

# Drop the existing table
drop_cursor = cnx.cursor()
drop_query = "DROP TABLE IF EXISTS unstructured_data;"
# Create a new table
create_cursor = cnx.cursor()
create_query = ("CREATE TABLE unstructured_data ("
"element_type VARCHAR(255), "
"filename VARCHAR(255), "
"date_modified DATETIME, "
"filetype VARCHAR(255), "
"page_number INT, "
"text TEXT);")

In [9]:

cursor = cnx.cursor()
# Loop through the DataFrame and insert each row into the table
for i, row in df.iterrows():
insert_query = """INSERT INTO unstructured_data (element_type, filename, date_modified, filetype, page_number, text)
VALUES (%s, %s, %s, %s, %s, %s);"""
cursor.execute(insert_query, (row['Element Type'], row['Filename'], row['Date Modified'], row['Filetype'], row['Page Number'], row['text']))

Create Text Embedding in the Table

Next, we enhance our database table by adding a new column for text embeddings. Using OpenAI's get_embedding function, we generate embeddings that measure the relatedness of text strings. These embeddings are particularly useful for search functionality, allowing us to rank results by relevance.

Reference: Understanding Text Embeddings

In [10]:

cursor = cnx.cursor(buffered=True)
# Add a new column for text embedding
alter_query = "ALTER TABLE unstructured_data ADD text_embedding TEXT;"

In [11]:

# Select and embed all text in table
query = "SELECT text FROM unstructured_data;"
rows = cursor.fetchall()
for i in rows:
text_embedding = json.dumps(get_embedding(i[0], engine="text-embedding-ada-002"))
update_query = ("UPDATE unstructured_data SET text_embedding = %s WHERE text = %s;")
data = (text_embedding, i[0])
cursor.execute(update_query, data)

Run User Query Based on Similarity Score

The retrieval process begins by selecting the table and text embeddings from our database. We then calculate similarity scores using numpy's dot product function, comparing the user query embeddings with the document embeddings. This allows us to identify and select the top-5 most similar entries, which are most relevant to the user's query.

Reference: How the Dot Product Measures Similarity

In [12]:

# User query
search_string = "What are the emergency management provisions include?"
search_embedding = get_embedding(search_string, engine="text-embedding-ada-002")
search_embedding_array = np.array(search_embedding)

In [13]:

cursor = cnx.cursor()
# Fetch text, type, filename, and embeddings from the unstructured_data table
query = "SELECT text, element_type, filename, text_embedding FROM unstructured_data;"
results = cursor.fetchall()
# Compute dot product scores
scores = []
for res in results:
text = res[0]
type_ = res[1]
filename = res[2]
embedding_str = res[3]
if embedding_str is not None:
embedding = json.loads(embedding_str)
embedding_array = np.array(embedding)
# Compute dot product for all records
score =, embedding_array)
scores.append((text, type_, filename, score))
# Sort by score and take the top 5
top_5 = sorted(scores, key=lambda x: x[3], reverse=True)[:5]
# Close the connection
# Display top-k records

Generate the Answer via OpenAI ChatCompletion

In the final step, we take the top-5 most similar entries retrieved from the database and use them as input for OpenAI's ChatCompletion. The ChatCompletion model is designed for both multi-turn conversations and single-turn tasks. It takes a list of messages as input and returns a model-generated message as output, providing us with a coherent and contextually relevant response based on the retrieved documents.

Reference: OpenAI Chat Completions API Guide

In [14]:

if top_5:
response = openai.ChatCompletion.create(
{"role": "system",
"content": "You are a useful assistant. Use the assistant's content to answer the user's query. Summarize your answer based on the context."
{"role": "assistant", "content": str(top_5)},
{"role": "user", "content": search_string},
assistant_message = response['choices'][0]['message']['content']
print("Assistant's Response:", assistant_message)
except Exception as e:
print(f"OpenAI API call failed: {e}")
print("No relevant documents found.")





This Notebook has been released under the Apache 2.0 open source license.