Ask questions of your PDFs with Unstructured
Notebook
Install Unstructured Library
We'll start by installing the Unstructured library, which is essential for ingesting and processing PDF files. The library will allow us to convert PDF documents into a JSON format that includes both metadata and text extraction. For this part of the project, we'll focus on installing the PDF support components.
Reference for full installation details: Unstructured Installation Guide
In [1]:
!pip install "unstructured[pdf]"
Import Libraries
In this section, we import the necessary libraries for our project. We'll use pandas
to handle data manipulation, converting our semi-structured JSON data into a structured DataFrame format. This is crucial for storing the data in the SingleStore database later on. Additionally, we'll utilize the OpenAI API for vectorizing text and generating responses, integral components of our RAG system.
In [2]:
import osimport jsonimport mysql.connectorimport pandas as pdimport numpy as npimport openaifrom openai.embeddings_utils import get_embedding
Configure OpenAI API and SingleStore Database
Before we proceed, it's important to configure our environment. This involves setting up access to the OpenAI API and the SingleStore cloud database. You'll need to retrieve your OpenAI API key and establish a connection with the SingleStore database. These steps are fundamental for enabling the interaction between our AI models and the database.
Obtain your OpenAI API key here: OpenAI API Key
Set up your SingleStore account and workspace: SingleStore Setup Guide
Connect to your SingleStore workspace: SingleStore Connection Documentation
In [3]:
# OpenAI API Keyopenai.api_key = os.environ["OPENAI_API_KEY"]# SingleStore DB Connectionhost=os.environ["SS_HOST"]port=3306username=os.environ["SS_USERNAME"]password=os.environ["SS_PASSWORD"]database=os.environ["SS_DATABASE"]
Unstructured PDF Partition
The PDF Partition step is critical for ingesting and processing the PDF document. Here, we define the filename of the PDF to be processed. We then use the partition_pdf
function to segment the PDF document, extracting various elements such as text, images, and tables. The function can execute locally or make a call to a remote inference server, depending on your setup.
Additionally, the chunk_by_title
function is used to organize the document into sections based on the presence of titles, with non-text elements being treated as separate sections. The "fast" strategy is applied for quick text extraction, which is suitable for text-heavy PDFs.
References:
In [4]:
pdf_filename = "Employee-Handbook.pdf"
In [5]:
from unstructured.partition.pdf import partition_pdffrom unstructured.chunking.title import chunk_by_titleelements = partition_pdf(pdf_filename,strategy="fast",)chunks = chunk_by_title(elements)
Reformat JSON Output into Structured Dataframe Format
After processing the PDF, we receive output in an unstructured JSON format, which includes valuable metadata about the extracted elements. This metadata enables us to filter and manipulate the document elements based on our requirements. Our next step is to convert this JSON output into a structured DataFrame, which is a more suitable format for storing in the SingleStore DB and for further processing in our RAG system.
Reference for understanding metadata: Unstructured Metadata Documentation
In [6]:
# Convert JSON output into Pandas DataFramedata = []for c in chunks:row = {}row['Element Type'] = type(c).__name__row['Filename'] = c.metadata.filenamerow['Date Modified'] = c.metadata.last_modifiedrow['Filetype'] = c.metadata.filetyperow['Page Number'] = c.metadata.page_numberrow['text'] = c.textdata.append(row)df = pd.DataFrame(data)# Show the DataFramedf.head()
Make Connection to SingleStore Database
In this step, we establish a connection to the SingleStore Database using the MySQL connector. This connection is vital for creating a new table that matches the structure of our DataFrame and for uploading our data. SingleStoreDB Cloud's compatibility with MySQL allows us to leverage its tools for managing data and executing data-related tasks efficiently.
References:
In [7]:
# Create connection to S2 Databasecnx = mysql.connector.connect(user=username,password=password,host=host,database=database)cnx
In [8]:
# Drop the existing tabledrop_cursor = cnx.cursor()drop_query = "DROP TABLE IF EXISTS unstructured_data;"drop_cursor.execute(drop_query)# Create a new tablecreate_cursor = cnx.cursor()create_query = ("CREATE TABLE unstructured_data (""element_id INT AUTO_INCREMENT PRIMARY KEY, ""element_type VARCHAR(255), ""filename VARCHAR(255), ""date_modified DATETIME, ""filetype VARCHAR(255), ""page_number INT, ""text TEXT);")create_cursor.execute(create_query)cnx.commit()drop_cursor.close()create_cursor.close()
In [9]:
cursor = cnx.cursor()# Loop through the DataFrame and insert each row into the tablefor i, row in df.iterrows():insert_query = """INSERT INTO unstructured_data (element_type, filename, date_modified, filetype, page_number, text)VALUES (%s, %s, %s, %s, %s, %s);"""cursor.execute(insert_query, (row['Element Type'], row['Filename'], row['Date Modified'], row['Filetype'], row['Page Number'], row['text']))cnx.commit()cursor.close()
Create Text Embedding in the Table
Next, we enhance our database table by adding a new column for text embeddings. Using OpenAI's get_embedding
function, we generate embeddings that measure the relatedness of text strings. These embeddings are particularly useful for search functionality, allowing us to rank results by relevance.
Reference: Understanding Text Embeddings
In [10]:
cursor = cnx.cursor(buffered=True)# Add a new column for text embeddingalter_query = "ALTER TABLE unstructured_data ADD text_embedding TEXT;"cursor.execute(alter_query)
In [11]:
# Select and embed all text in tablequery = "SELECT text FROM unstructured_data;"cursor.execute(query)rows = cursor.fetchall()for i in rows:text_embedding = json.dumps(get_embedding(i[0], engine="text-embedding-ada-002"))update_query = ("UPDATE unstructured_data SET text_embedding = %s WHERE text = %s;")data = (text_embedding, i[0])cursor.execute(update_query, data)cnx.commit()cursor.close()
Run User Query Based on Similarity Score
The retrieval process begins by selecting the table and text embeddings from our database. We then calculate similarity scores using numpy's dot product function, comparing the user query embeddings with the document embeddings. This allows us to identify and select the top-5 most similar entries, which are most relevant to the user's query.
Reference: How the Dot Product Measures Similarity
In [12]:
# User querysearch_string = "What are the emergency management provisions include?"search_embedding = get_embedding(search_string, engine="text-embedding-ada-002")search_embedding_array = np.array(search_embedding)
In [13]:
cursor = cnx.cursor()# Fetch text, type, filename, and embeddings from the unstructured_data tablequery = "SELECT text, element_type, filename, text_embedding FROM unstructured_data;"cursor.execute(query)results = cursor.fetchall()# Compute dot product scoresscores = []for res in results:text = res[0]type_ = res[1]filename = res[2]embedding_str = res[3]if embedding_str is not None:embedding = json.loads(embedding_str)embedding_array = np.array(embedding)# Compute dot product for all recordsscore = np.dot(search_embedding_array, embedding_array)scores.append((text, type_, filename, score))# Sort by score and take the top 5top_5 = sorted(scores, key=lambda x: x[3], reverse=True)[:5]# Close the connectioncursor.close()cnx.close()# Display top-k recordstop_5
Generate the Answer via OpenAI ChatCompletion
In the final step, we take the top-5 most similar entries retrieved from the database and use them as input for OpenAI's ChatCompletion. The ChatCompletion model is designed for both multi-turn conversations and single-turn tasks. It takes a list of messages as input and returns a model-generated message as output, providing us with a coherent and contextually relevant response based on the retrieved documents.
Reference: OpenAI Chat Completions API Guide
In [14]:
if top_5:try:response = openai.ChatCompletion.create(model="gpt-4",messages=[{"role": "system","content": "You are a useful assistant. Use the assistant's content to answer the user's query. Summarize your answer based on the context."},{"role": "assistant", "content": str(top_5)},{"role": "user", "content": search_string},],temperature=0)assistant_message = response['choices'][0]['message']['content']print("Assistant's Response:", assistant_message)except Exception as e:print(f"OpenAI API call failed: {e}")else:print("No relevant documents found.")
Details
About this Template
Ask questions of your unstructured PDFs. In this notebook, Unstructured.io ingests pdfs accurately, then Open AI is used to create embeddings, the vector data is stored in SingleStore and finally ask questions of your PDF data
This Notebook can be run in Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.