New

Resume Evaluator with Vector Index

Notebook

SingleStore Notebooks

Resume Evaluator with Vector Index

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

In this notebook, we will showcase a practical use-case of evaluating resumes leveraging the combined powers of SingleStoreDB and OpenAI. Moving beyond traditional resume-matching techniques, we introduce a method that dives deeper into the nuances of resume content and the specific needs of a job description.

With SingleStoreDB's innate ability to swiftly manage and query data, we begin by ingesting a resume PDF, extracting its textual content, and storing it in a dedicated database table. But we don’t stop there! The resume's text is then processed by OpenAI's LLM, summarizing its key points. This summarized version, along with its vector representation, is saved back to our SingleStoreDB table.

When it's time to match a resume to a job description, we translate the latter into its vector form. Using a dot_product operation, we search against the table housing the resume summaries' embeddings. This provides us with resumes most aligned with the job description. To add another layer of precision, the matched resumes are further evaluated alongside the job description using OpenAI's LLM, offering a comprehensive assessment.

Join us on this journey to redefine resume evaluations using SQL, SingleStoreDB, and OpenAI for better, more insightful matches!

1. Create a workspace in your workspace group

S-00 is sufficient.

2. Create a database named `resume_evaluator`

If you want to use free tier, follow below steps

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [1]:

1
shared_tier_check = %sql show variables like 'is_shared_tier'
2
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
3
    %sql DROP DATABASE IF EXISTS new_transactions;
4
    %sql CREATE DATABASE new_transactions;

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Note

If you want to use dedicated workspace, directly select workspace and db from dropdown at the top and follow below steps to create databse.

In [2]:

1
%%sql
2
DROP DATABASE IF EXISTS resume_evaluator;
3
CREATE DATABASE resume_evaluator;
4
USE resume_evaluator;

In [3]:

1
%%sql
2
CREATE TABLE IF NOT EXISTS resume_evaluator.resumes_profile_data(
3
    names text,
4
    email text,
5
    phone_no text,
6
    years_of_experience text,
7
    skills text,
8
    profile_name text,
9
    resume_summary text,
10
    resume_embeddings vector(1536, F32)
11
    );

3. Install and import required libraries

In this section, we will set up the necessary environment by installing some crucial libraries. For our task of extracting text from resume PDFs, we'll be using pdfminer.six. To interact with the OpenAI's LLM and manage our data efficiently, openai will be instrumental.

The install process may take a couple minutes.

In [4]:

!pip install -q pdfminer.six openai boto3 sqlalchemy.singlestoredb

In [5]:

1
import os
2
import requests
3
import re
4
import json
5
import getpass
6
from pdfminer.high_level import extract_text
7
import openai
8
from openai import OpenAI
9
import pandas as pd
10
import numpy as np
11
from sqlalchemy import text, create_engine

4. Create a function called `get_embedding()`

In our workflow, we need a consistent way to transform textual content into vector embeddings. To achieve this, we introduce the get_embedding() function.

This function takes in a piece of text and, by default, uses the "text-embedding-ada-002" model to produce embeddings. We ensure that any newline characters in the text are replaced with spaces to maintain the integrity of the input. The function then leverages OpenAI's API to generate and retrieve the embedding for the given text.

In [6]:

1
api_key = getpass.getpass('OpenAI API Key: ')
2
openai.api_key = api_key
3
client = OpenAI(api_key = api_key)

In [7]:

1
def get_embedding(text, model="text-embedding-ada-002"):
2
    text = text.replace("\n", " ")
3
    response = openai.embeddings.create(input=[text], model=model)
4
    return response.data[0].embedding

5. Create a function called `print_pdf_text()`

This function is designed to extract and clean the text from a provided PDF, either from a web URL or a local file path.

Input:

url: Web URL of the PDF (optional).
file_path: Local path of the PDF (optional).

Functionality:

Source Determination:
- Fetches PDF from url or uses the local file_path.
- Raises an error if neither is provided.
Text Extraction: Extracts text from the PDF using pdfminer.
Text Cleaning:
- Removes special characters, retaining only "@", "+", ".", and "/".
- Improves formatting by handling newline characters.
Cleanup: If a temporary file was created from a URL, it gets deleted post-processing.

Output:

Returns the cleaned and formatted text from the PDF.

In [8]:

1
def print_pdf_text(url=None, file_path=None):
2
    # Determine the source of the PDF (URL or local file)
3
    if url:
4
        response = requests.get(url)
5
        response.raise_for_status()  # Ensure the request was successful
6
        temp_file_path = "temp_pdf_file.pdf"
7
        with open(temp_file_path, 'wb') as temp_file:
8
            temp_file.write(response.content)  # Save the PDF to a temporary file
9
        pdf_source = temp_file_path
10
    elif file_path:
11
        pdf_source = file_path  # Set the source to the provided local file path
12
    else:
13
        raise ValueError("Either url or file_path must be provided.")
14

15
    # Extract text using pdfminer
16
    text = extract_text(pdf_source)
17

18
    # Remove special characters except "@", "+", ".", and "/"
19
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s@+./:,]", "", text)
20

21
    # Format the text for better readability
22
    cleaned_text = cleaned_text.replace("\n\n", " ").replace("\n", " ")
23
    # If a temporary file was used, delete it
24
    if url and os.path.exists(temp_file_path):
25
        os.remove(temp_file_path)
26

27
    return cleaned_text

6. Create a function called `pinfo_extractor()`

This function is tailored to extract specific details from a candidate's resume text.

Input:

resume_text: The text extracted from a candidate's resume.

Functionality:

Prompt Creation:
- A context is formed using the provided resume text.
- A detailed question prompt is generated to guide the extraction of desired details from the resume.
OpenAI API Interaction:
- Uses the gpt-3.5-turbo model to process the prompt and generate a detailed extraction.
- Extracts relevant sections like Name, Email, Phone Number, and more from the generated response.
Data Structuring:
- The extracted details are organized into a dictionary.

Output:

Returns a dictionary with keys such as 'name', 'email', 'phone_no', and more, containing extracted information from the resume.

In [9]:

1
def pinfo_extractor(resume_text):
2
    context = f"Resume text: {resume_text}"
3
    question = """ From above candidate's resume text, extract the only following details:
4
                Name: (Find the candidate's full name. If not available, specify "not available.")
5
                Email: (Locate the candidate's email address. If not available, specify "not available.")
6
                Phone Number: (Identify the candidate's phone number. If not found, specify "not available.")
7
                Years of Experience: (If not explicitly mentioned, calculate the years of experience by analyzing the time durations at each company or position listed. Sum up the total durations to estimate the years of experience. If not determinable, write "not available.")
8
                Skills Set: Extract the skills which are purely technical and represent them as: [skill1, skill2,... <other skills from resume>]. If no skills are provided, state "not available."
9
                Profile: (Identify the candidate's job profile or designation. If not mentioned, specify "not available.")
10
                Summary: provide a brief summary of the candidate's profile without using more than one newline to segregate sections.
11
                """
12

13
    prompt = f"""
14
        Based on the below given candidate information, only answer asked question:
15
        {context}
16
        Question: {question}
17
    """
18
    # print(prompt)
19
    response = client.chat.completions.create(
20
        model="gpt-3.5-turbo",
21
        messages=[
22
            {"role": "system", "content": "You are a helpful HR recruiter."},
23
            {"role": "user", "content": prompt}
24
        ],
25
        max_tokens=700,
26
        temperature=0.5,
27
        n=1  # assuming you want one generation per document
28
    )
29
    # Extract the generated response
30
    response_text = response.choices[0].message.content # response['choices'][0]['message']['content']
31
    # print(response_text)
32
    # Split the response_text into lines
33
    lines = response_text.strip().split('\n')
34

35
    # Now, split each line on the colon to separate the labels from the values
36
    # Extract the values
37
    name = lines[0].split(': ')[1]
38
    email = lines[1].split(': ')[1]
39
    phone_no = lines[2].split(': ')[1]
40
    years_of_expiernce = lines[3].split(': ')[1]
41
    skills = lines[4].split(': ')[1]
42
    profile = lines[5].split(': ')[1]
43
    summary = lines[6].split(': ')[1]
44
    data_dict = {
45
        'name': name,
46
        'email': email,
47
        'phone_no': phone_no,
48
        'years_of_expiernce': years_of_expiernce,
49
        'skills': skills,
50
        'profile': profile,
51
        'summary': summary
52
    }
53
    # print(data_dict, "\n")
54
    return data_dict;

7. Create a function called `add_data_to_db()`

This function is designed to add extracted resume details into a database.

Input:

input_dict: Dictionary containing details like 'name', 'email', 'phone_no', and more extracted from a resume.

Functionality:

Database Connection:
- Establishes a connection to the database using SQLAlchemy's create_engine with the given connection URL.
Embedding Creation:
- Calls the get_embedding() function to generate an embedding for the resume summary.
SQL Query Formation:
- Crafts an SQL query to insert the provided data (from the input dictionary) into the resumes_profile_data table in the database.
Data Insertion:
- Opens a connection, executes the SQL query, commits the changes, and then closes the connection.

Output:

Prints a confirmation message upon successful data insertion.

In [10]:

1
def add_data_to_db(input_dict):
2
    # Create the SQLAlchemy engine
3
    engine = create_engine(f'singlestoredb://{connection_user}:{connection_password}@{connection_host}:{connection_port}/resume_evaluator')
4

5
    engine = create_engine(connection_url)
6
    # Get the embedding for the summary text
7
    summary = input_dict['summary']
8
    embedding = get_embedding(summary)
9
    # Create the SQL query for inserting the data
10
    query_sql = f"""
11
        INSERT INTO resumes_profile_data (names, email, phone_no, years_of_experience, skills, profile_name, resume_summary, resume_embeddings)
12
        VALUES ("{input_dict['name']}", "{input_dict['email']}", "{input_dict['phone_no']}", "{input_dict['years_of_expiernce']}",
13
        "{input_dict['skills']}", "{input_dict['profile']}", "{summary}", '{embedding}');
14
    """
15
    with engine.connect() as connection:
16
        connection.execute(text(query_sql))
17
        connection.commit()
18
    #print("\nData Written to resumes_profile_data_2 table")
19
    connection.close()

8. Create a function called `search_resumes()`

This function facilitates the search for resumes that are most similar to a given query, leveraging embeddings and database operations.

Input:

query: A string that represents the job description or any other search criteria.

Functionality:

Embedding Creation:
- Converts the given query into its corresponding embedding using the get_embedding() function.
SQL Query Formation:
- Creates an SQL query to search for the top 5 resumes in the resumes_profile_data table that have the highest similarity (dot product) to the query embedding.
Database Operations:
- Opens a connection to the database, runs the SQL query to fetch the results, and then closes the connection.

Output:

Returns a list of the top 5 most relevant resumes based on the given query.

In [11]:

1
def search_resumes(query):
2
    query_embed = get_embedding(query)
3
    query_sql = f"""
4
            SELECT names, resume_summary, dot_product(
5
            '{query_embed}':> VECTOR(1536), resume_embeddings
6
                ) AS similarity
7
                FROM resumes_profile_data
8
                ORDER BY similarity DESC
9
                LIMIT 5;
10
    """
11
    # print(query_sql,"\n")
12
    # engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database_name}')
13
    engine = create_engine(connection_url)
14
    connection = engine.connect()
15
    result = connection.execute(text(query_sql)).fetchall()
16
    connection.close()
17
    engine.dispose()
18
    return result

9. Create a function called `evaluate_candidates()`

This function is geared towards evaluating the compatibility of candidates' resumes in relation to a specific job description.

Input:

query: A string that represents the job description against which candidates' resumes will be assessed.

Functionality:

Resume Retrieval:
- Utilizes the search_resumes() function to get the top matching resumes based on the job description.
OpenAI API Interaction:
- For each retrieved resume, a prompt is crafted, asking to evaluate how well the candidate fits the job description.
- Interacts with the gpt-3.5-turbo model to process this prompt and receive an efficient, concise response.
Data Aggregation:
- Collects the model's evaluation responses for each candidate in a list.

Output:

Returns a list of tuples, where each tuple contains:

Candidate's name.
Evaluation response from the model, describing the compatibility of the candidate with the given job description.

In [12]:

1
def evaluate_candidates(query):
2
    result = search_resumes(query)
3
    responses = []  # List to store responses for each candidate
4
    for resume_str in result:
5
        name = resume_str[0]
6
        context = f"Resume text: {resume_str[1]}"
7
        question = f"How does the candidate meet for the following job description? Is he/she fit for the given profiles/roles and responsibilities ?answer in 3 lines only and be effcient while answering: {query}."
8
        prompt = f"""
9
            Read below candidate information about the candidate:
10
            {context}
11
            Question: {question}
12
        """
13
        response = client.chat.completions.create(
14
            model="gpt-3.5-turbo",
15
            messages=[
16
                {"role": "system", "content": "You are a expert HR analyst and recuriter."},
17
                {"role": "user", "content": prompt}
18
            ],
19
            max_tokens=100,
20
            temperature=0.2,
21
            n=1  # assuming you want one generation per document
22
        )
23
        # Extract the generated response
24
        response_text = response.choices[0].message.content # response['choices'][0]['message']['content']
25
        responses.append((name, response_text))  # Append the name and response_text to the responses list
26
    return responses

In [13]:

1
import boto3
2
from botocore import UNSIGNED
3
from botocore.config import Config
4

5
def list_s3_files(bucket_name, prefix):
6
    s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
7
    paginator = s3.get_paginator('list_objects_v2')
8
    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)
9

10
    urls = []
11
    for page in page_iterator:
12
        if "Contents" in page:
13
            for obj in page['Contents']:
14
                object_url = f"https://{bucket_name}.s3.amazonaws.com/{obj['Key']}"
15
                urls.append(object_url)
16

17
    return urls
18

19
bucket_name = 'real-time-anomaly-detection-demo'
20
path_prefix = 'resume-eval/'
21

22
urls = list_s3_files(bucket_name, path_prefix)
23
# urls

In [14]:

# You can use above urls list, for demo purpose we are using below only

urls = [

'https://real-time-anomaly-detection-demo.s3.amazonaws.com/resume-eval/16b67c12-62e9-4d81-8f9f-ef3112daee13.pdf',

'https://real-time-anomaly-detection-demo.s3.amazonaws.com/resume-eval/1a394211-8074-4a88-a6ce-70c1d83f8a49.pdf',

'https://real-time-anomaly-detection-demo.s3.amazonaws.com/resume-eval/1f7f185a-3629-45d8-a21a-7fd57ee541af.pdf',

'https://real-time-anomaly-detection-demo.s3.amazonaws.com/resume-eval/21590a01-b5f7-4739-81b5-6aca293cc80a.pdf',

'https://real-time-anomaly-detection-demo.s3.amazonaws.com/resume-eval/22f701ff-6402-463b-8b7c-d5c252607763.pdf']

for url in urls:

    resume_text = print_pdf_text(url=url).replace('\n',' ')

    #print("Resume Text extracted\n")

    ip_data_dict = pinfo_extractor(resume_text)

    #print("Information extracted\n")

    add_data_to_db(ip_data_dict)

    # print("\n")

In [15]:

1
%%sql
2
SELECT COUNT(*) FROM resumes_profile_data;

10. Using Vector Index in SingleStore

Vector indexes in SingleStore enhance query performance for high-dimensional data, commonly used in machine learning, search applications, and recommendation systems. Understanding when and why to use each type of vector index is crucial for optimizing your database's performance and storage efficiency.

Types of Vector Indexes

FLAT

When to Use: Small datasets or when exact nearest neighbors are required.
Why: Performs a full scan, ensuring the most accurate results but can be slower on large datasets.

HNSW (Hierarchical Navigable Small World)

When to Use: General-purpose, suitable for a broad range of datasets.
Why: Balances speed and accuracy well, efficient for both small and large datasets. Offers faster queries by constructing a multi-layered graph structure.

IVF (Inverted File Index)

When to Use: Large datasets where query speed is critical, and approximate results are acceptable.
Why: Divides the dataset into clusters to reduce the search area, significantly increasing query speed at the cost of slight accuracy loss.

Choosing the Right Index

Dataset Size: For smaller datasets or when absolute accuracy is paramount, FLAT is suitable. For larger datasets or when query performance is a priority, consider HNSW or IVF.
Query Speed vs. Accuracy: If your application requires the fastest possible queries and can tolerate approximate results, IVF is recommended. For a good balance between speed and accuracy, HNSW is often the best choice.
Application Requirements: Recommendation systems and content similarity searches often benefit from HNSW due to its efficiency and balance. Exact match searches or smaller datasets may lean towards FLAT.

Selecting the right vector index depends on your specific application needs, the size of your dataset, and the balance you wish to strike between query speed and accuracy. Experimenting with different index types and configurations is key to finding the optimal setup for your use case.

In [16]:

1
%%sql
2
ALTER TABLE resumes_profile_data
3
ADD VECTOR INDEX ivf_flat (new_resume_embeddings) INDEX_OPTIONS '{"index_type":"IVF_FLAT"}';

In [17]:

1
evaluate_candidates(input("Enter Job description : \n"))

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [18]:

1
shared_tier_check = %sql show variables like 'is_shared_tier'
2
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
3
    %sql DROP DATABASE IF EXISTS new_transactions;

Details

About this Template

Resume Evaluator using vectors stored in SingleStore to find sutiable resume for your job description, with Indexed ANN search.

This Notebook can be run in Shared Tier, Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

Resume Evaluator with Vector Index

Notebook

Resume Evaluator with Vector Index

1. Create a workspace in your workspace group

2. Create a database named resume_evaluator

If you want to use free tier, follow below steps

3. Install and import required libraries

4. Create a function called get_embedding()

5. Create a function called print_pdf_text()

Input:

Functionality:

Output:

6. Create a function called pinfo_extractor()

Input:

Functionality:

Output:

7. Create a function called add_data_to_db()

Input:

Functionality:

Output:

8. Create a function called search_resumes()

Input:

Functionality:

Output:

9. Create a function called evaluate_candidates()

Input:

Functionality:

Output:

10. Using Vector Index in SingleStore

Types of Vector Indexes

FLAT

HNSW (Hierarchical Navigable Small World)

IVF (Inverted File Index)

Choosing the Right Index

Clean up

Details

About this Template

Tags

License

See Notebook in action

2. Create a database named `resume_evaluator`

4. Create a function called `get_embedding()`

5. Create a function called `print_pdf_text()`

6. Create a function called `pinfo_extractor()`

7. Create a function called `add_data_to_db()`

8. Create a function called `search_resumes()`

9. Create a function called `evaluate_candidates()`