New

Ask questions of your PDFs with PDFPlumber

Notebook


SingleStore Notebooks

Ask questions of your PDFs with PDFPlumber

Install PDFPlumber Library

We'll start by installing the PDFPlumber library, which is essential for ingesting and processing PDF files. The library will allow us to convert PDF documents into a JSON format that includes both metadata and text extraction. For this part of the project, we'll focus on installing the PDF support components.

Reference for full installation details: PDFPlumber Installation Guide

In [1]:

1!pip install pdfplumber==0.11.0

Import Libraries

In this section, we import the necessary libraries for our project. We'll use pandas to handle data manipulation, converting our semi-structured JSON data into a structured DataFrame format. This is crucial for storing the data in the SingleStore database later on. Additionally, we'll utilize the OpenAI API for vectorizing text and generating responses, integral components of our RAG system.

In [2]:

1!pip install "openai==0.28.1"

In [3]:

1import os2import json3import pandas as pd4import numpy as np5import singlestoredb as s26
7import openai

Configure OpenAI API and SingleStore Database

Before we proceed, it's important to configure our environment. This involves setting up access to the OpenAI API and the SingleStore cloud database. You'll need to retrieve your OpenAI API key and establish a connection with the SingleStore database. These steps are fundamental for enabling the interaction between our AI models and the database.

In [4]:

1import os2from getpass import getpass3os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key: ")

In [5]:

1try:2    s2_conn = s2.connect()3    s2_conn.autocommit(True)4    s2_cur = s2_conn.cursor()5    print("SingleStore connection successful!")6except Exception as e:7    raise RuntimeError(f"SingleStore connection failed: {e}")

PDF Extraction & Chunking (pdfplumber)

We use pdfplumber (a lightweight, standalone PDF text extraction library) flow. This approach:

  • Opens the PDF and extracts raw text per page.

  • Applies a simple heading regex to split pages into logical sections (chunks) based on visually uppercase or structured headings (e.g., SECTION 1, Chapter 2, POLICY GUIDELINES).

  • Produces a list of chunk dictionaries you can load into a DataFrame and embed.

References:

  • pdfplumber: [https://github.com/jsvine/pdfplumber]

  • PyMuPDF (optional alternative): [https://pymupdf.readthedocs.io/en/latest/]

Uploading PDF File to Stage

Upload the PDF to the Stage folder (Deployments tab) for the chosen workspace group before ingesting the contents

References:

In [6]:

1%%sql2DOWNLOAD STAGE FILE 'Employee-Handbook.pdf' TO 'Employee-Handbook.pdf' OVERWRITE

In [7]:

1pdf_filename = "Employee-Handbook.pdf"

In [8]:

1import pdfplumber, re2
3# Extract pages4pages = []5try:6    with pdfplumber.open(pdf_filename) as pdf:7        for i, page in enumerate(pdf.pages):8            text = page.extract_text() or ""9            pages.append({"page_number": i+1, "text": text})10    print(f"Loaded {len(pages)} pages.")11except Exception as e:12    raise RuntimeError(f"pdfplumber failed to read PDF: {e}")13
14# heading regex15heading_re = re.compile(r"^(?:[A-Z][A-Z0-9 \-/]{3,}|Section\s+\d+|Chapter\s+\d+)$")16chunks = []17current_title = None18current_body = []19current_page_start = None20
21for page in pages:22    for line in page["text"].splitlines():23        line_stripped = line.strip()24        if heading_re.match(line_stripped) and len(line_stripped.split()) <= 15:25            # flush previous26            if current_body:27                chunks.append({28                    "title": current_title,29                    "body": "\n".join(current_body),30                    "page_start": current_page_start,31                    "page_end": last_page_num32                })33                current_body = []34            current_title = line_stripped35            current_page_start = page["page_number"]36        else:37            if line_stripped:38                current_body.append(line_stripped)39                last_page_num = page["page_number"]40
41# Flush last chunk42if current_body:43    chunks.append({44        "title": current_title,45        "body": "\n".join(current_body),46        "page_start": current_page_start,47        "page_end": last_page_num48    })49
50print(f"Chunking produced {len(chunks)} chunks.")

Reformat JSON Output into Structured Dataframe Format

After processing the PDF, we receive output in an unstructured JSON format, which includes valuable metadata about the extracted elements. This metadata enables us to filter and manipulate the document elements based on our requirements. Our next step is to convert this JSON output into a structured DataFrame, which is a more suitable format for storing in the SingleStore DB and for further processing in our RAG system.

Reference for understanding metadata: Unstructured Metadata Documentation

In [9]:

1# Convert chunk dictionaries into Pandas DataFrame2import pandas as pd3
4data = []5for c in chunks:6    row = {}7    row['Element Type'] = 'Chunk'8    row['Filename'] = pdf_filename9    row['Date Modified'] = None  # Not available via pdfplumber10    row['Filetype'] = 'pdf'11    # Use start page (could also store range)12    row['Page Number'] = c.get('page_start')13    # Combine title + body14    if c.get('title'):15        row['text'] = f"{c.get('title')}\n{c.get('body')}"16    else:17        row['text'] = c.get('body')18    data.append(row)19
20df = pd.DataFrame(data)21print(f"DataFrame rows: {len(df)}")22df.head()

Make Connection to SingleStore Database

In this step, we establish a connection to the SingleStore Database. This connection is vital for creating a new table that matches the structure of our DataFrame and for uploading our data. SingleStoreDB Cloud's compatibility with MySQL allows us to leverage its tools for managing data and executing data-related tasks efficiently.

References:

In [10]:

1s2_cur.execute("DROP TABLE IF EXISTS unstructured_data;")2create_query = (3    "CREATE TABLE unstructured_data ("4    "element_id INT AUTO_INCREMENT PRIMARY KEY, "5    "element_type VARCHAR(255), "6    "filename VARCHAR(255), "7    "date_modified DATETIME, "8    "filetype VARCHAR(255), "9    "page_number INT, "10    "text TEXT)"11)12s2_cur.execute(create_query)13print("Table unstructured_data ready.")

In [11]:

1for i, row in df.iterrows():2    insert_query = (3        "INSERT INTO unstructured_data (element_type, filename, date_modified, filetype, page_number, text) "4        "VALUES (%s, %s, %s, %s, %s, %s);"5    )6    s2_cur.execute(insert_query, (7        row['Element Type'], row['Filename'], row['Date Modified'], row['Filetype'], row['Page Number'], row['text']8    ))9print(f"Inserted {len(df)} rows into unstructured_data.")

Create Text Embedding in the Table

Next, we enhance our database table by adding a new column for text embeddings. Using OpenAI's get_embedding function, we generate embeddings that measure the relatedness of text strings. These embeddings are particularly useful for search functionality, allowing us to rank results by relevance.

Reference: Understanding Text Embeddings

In [12]:

1s2_cur.execute("ALTER TABLE unstructured_data ADD COLUMN  text_embedding TEXT;")2print("Added text_embedding column.")

In [13]:

1import os, time, json2DEFAULT_EMBED_MODEL = "text-embedding-3-small"3DEFAULT_CHAT_MODEL = "gpt-4o-mini"4
5_openai_client = None6_use_new = False7
8def _ensure_key():9    key = os.getenv("OPENAI_API_KEY")10    if key and not getattr(openai, 'api_key', None):11        openai.api_key = key.strip()12    if not getattr(openai, 'api_key', None):13        raise ValueError("OpenAI API key not set. Set OPENAI_API_KEY env or run the key input cell.")14
15def _init_client():16    global _openai_client, _use_new17    if _openai_client is not None:18        return19    try:20        from openai import OpenAI21        _openai_client = OpenAI(api_key=openai.api_key)22        _use_new = True23    except Exception:24        _openai_client = None25        _use_new = False26
27def embed_texts(text_list, model=DEFAULT_EMBED_MODEL):28    _ensure_key(); _init_client()29    if _use_new and _openai_client is not None:30        resp = _openai_client.embeddings.create(model=model, input=text_list)31        return [d.embedding for d in resp.data]32    else:33        resp = openai.Embedding.create(model=model, input=text_list)34        return [d['embedding'] for d in resp['data']]35
36def embed_text(text, model=DEFAULT_EMBED_MODEL):37    return embed_texts([text], model=model)[0]38
39def chat_completion(messages, model=DEFAULT_CHAT_MODEL, temperature=0):40    _ensure_key(); _init_client()41    if _use_new and _openai_client is not None:42        resp = _openai_client.chat.completions.create(model=model, messages=messages, temperature=temperature)43        return resp.choices[0].message.content44    else:45        resp = openai.ChatCompletion.create(model=model, messages=messages, temperature=temperature)46        return resp['choices'][0]['message']['content']

In [14]:

1import json, time2
3EMBED_MODEL = "text-embedding-3-small"4BATCH_SIZE = 105MAX_RETRIES = 36
7# Fetch rows needing embeddings8s2_cur.execute("SELECT element_id, text FROM unstructured_data WHERE text_embedding IS NULL OR text_embedding = '';")9rows = s2_cur.fetchall()10print(f"Rows needing embeddings: {len(rows)}")11
12for i in range(0, len(rows), BATCH_SIZE):13    batch = rows[i:i+BATCH_SIZE]14    texts = [t for _, t in batch]15    attempt = 016    while True:17        try:18            embeddings = embed_texts(texts, model=EMBED_MODEL)19            break20
21        except Exception as e:22            attempt += 123            if attempt >= MAX_RETRIES:24                print(f"Failed batch starting at index {i}: {e}")25                embeddings = [None]*len(batch)26                break27            sleep_time = 2 ** attempt28            print(f"Retry {attempt} for batch starting at {i} after error: {e}. Sleeping {sleep_time}s")29            time.sleep(sleep_time)30    for (element_id, _), emb in zip(batch, embeddings):31        if emb:32            s2_cur.execute("UPDATE unstructured_data SET text_embedding = %s WHERE element_id = %s;", (json.dumps(emb), element_id))33
34print("Embedding update complete.")

Run User Query Based on Similarity Score

The retrieval process begins by selecting the table and text embeddings from our database. We then calculate similarity scores using numpy's dot product function, comparing the user query embeddings with the document embeddings. This allows us to identify and select the top-5 most similar entries, which are most relevant to the user's query.

Reference: How the Dot Product Measures Similarity

In [15]:

1search_string = "What are the emergency management provisions include?"2search_embedding = embed_text(search_string)3search_embedding_array = np.asarray(search_embedding, dtype=np.float32)

In [16]:

1# Fetch text, type, filename, and embeddings from the unstructured_data table using singlestoredb2s2_cur.execute("SELECT text, element_type, filename, text_embedding FROM unstructured_data WHERE text_embedding IS NOT NULL;")3results = s2_cur.fetchall()4
5scores = []6for text, type_, filename, embedding_str in results:7    if embedding_str:8        embedding = json.loads(embedding_str)9        embedding_array = np.array(embedding)10        score = np.dot(search_embedding_array, embedding_array)11        scores.append((text, type_, filename, score))12
13# Sort by score and take the top 514top_5 = sorted(scores, key=lambda x: x[3], reverse=True)[:5]15
16# Display top-k records17top_5

Generate the Answer via OpenAI ChatCompletion

In the final step, we take the top-5 most similar entries retrieved from the database and use them as input for OpenAI's ChatCompletion. The ChatCompletion model is designed for both multi-turn conversations and single-turn tasks. It takes a list of messages as input and returns a model-generated message as output, providing us with a coherent and contextually relevant response based on the retrieved documents.

Reference: OpenAI Chat Completions API Guide

In [17]:

1if top_5:2    try:3        response = openai.ChatCompletion.create(4            model="gpt-5",5            messages=[6                {"role": "system",7                 "content": "You are a useful assistant. Use the assistant's content to answer the user's query. Summarize your answer based on the context."8                },9                {"role": "assistant", "content": str(top_5)},10                {"role": "user", "content": search_string},11            ],12            temperature=013        )14
15        assistant_message = response['choices'][0]['message']['content']16        print("Assistant's Response:", assistant_message)17
18    except Exception as e:19        print(f"OpenAI API call failed: {e}")20else:21    print("No relevant documents found.")

Details


About this Template

Ask questions of your unstructured PDFs. In this notebook, PDFPlumber ingests pdfs, then Open AI is used to create embeddings, the vector data is stored in SingleStore and finally ask questions of your PDF data

This Notebook can be run in Standard and Enterprise deployments.

Tags

ingestpdfvectorpdfplumber

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.