Using OpenAI and SingleStore for Automated Resume Scans + Assessments

In the dynamic field of human resources and recruitment, the influx of resumes and candidate data presents both opportunities and challenges. Harnessing this wealth of information for efficient talent acquisition requires innovative tools that can process and analyze data swiftly and accurately.

Using OpenAI and SingleStore for Automated Resume Scans + Assessments

With the integration of advanced AI capabilities from OpenAI and SingleStore’s comprehensive data platform, organizations can transform the way they evaluate and select candidates, turning the tide of overwhelming data into actionable insights for strategic hiring decisions.

the-challenges-of-manual-bulk-resume-processingThe challenges of manual bulk resume processing

In the fast-paced HR environment, manual bulk resume processing is time-consuming and often leads to inefficiencies. Recruiters face the daunting task of sifting through hundreds of resumes, trying to match them with job descriptions. This traditional approach is not only labor-intensive but also prone to human error, leading to potential oversight of qualified candidates.

Introducing SingleStore

For those who are unfamiliar, SingleStore (formerly known as MemSQL), is the world’s only real-time data platform that can read, write and reason on petabyte-scale data in a few milliseconds. At its core is a high-performance SQL database engine that allows you to transact, analyze and contextualize your data. Its hybrid data architecture, which combines the capabilities of an in-memory database with the scalability of distributed systems, enables SingleStore to offer lightning-fast data ingestion rates and query execution times.

Beyond its HTAP capabilities, SingleStore serves as the ideal contextual database to power modern generative AI workloads that require RAG, vector functions, semantic search and keyword search. As a contextual database SingleStore has the ability to run hybrid searches across several data types (text, JSON, BLOB, time-series, geospatial and relational SQL data), co-locating your vector data with transactional and analytical data in plain SQL! 

vector-data-type-and-indexingVector data type and indexing

SingleStore's approximate-nearest-neighbor (ANN) search leverages indexed searches to efficiently find the top-K closest matches to a query vector, which is crucial for large-scale data sets and applications like semantic text search. On the other hand, a k-nearest neighbor (KNN) search provides exact matches but is generally more resource intensive.

Using vector indexes in SingleStore

Vector indexes in SingleStore enhance query performance for high-dimensional data, commonly used in machine learning, search applications and recommendation systems. Understanding when and why to use each type of vector index is crucial for optimizing your database's performance and storage efficiency.

Types of vector indexes include:

FLAT

  • When to use. Small datasets, or when exact nearest neighbors are required.
  • Why? Performs a full scan ensuring the most accurate results, but can be slower on large datasets.

HNSW (Hierarchical Navigable Small World)

  • When to use. General purpose, suitable for a broad range of datasets.
  • Why? Balances speed and accuracy well, efficient for both small and large datasets; offers faster queries by constructing a multi-layered graph structure.

IVF (Inverted File Index)

  • When to use. Large datasets where query speed is critical, and approximate results are acceptable.
  • Why? Divides the dataset into clusters to reduce the search area, significantly increasing query speed at the cost of slight accuracy loss.

Selecting the right vector index depends on your specific application needs, the size of your dataset and the balance you wish to strike between query speed and accuracy. Experimenting with different index types and configurations is key to finding the optimal setup for your use case:

  • Dataset size. For smaller datasets or when absolute accuracy is paramount, FLAT is suitable. For larger datasets or when query performance is a priority, consider HNSW or IVF.
  • Query speed vs. accuracy. If your application requires the fastest possible queries and can tolerate approximate results, IVF is recommended. For a good balance between speed and accuracy, HNSW is often the best choice.
  • Application requirements. Recommendation systems and content similarity searches often benefit from HNSW due to its efficiency and balance. Exact match searches or smaller datasets may lean toward FLAT.

the-architectureThe architecture

Establishing the database

The process begins by setting up a database environment for storing and processing resume data. Using SQL within the Jupyter environment, a database named resume_evaluator is created, along with a table resumes_profile_data to store essential candidate information and resume embeddings. The resumes_profile_data table contains the embeddings column, stored as a native VECTOR data type in SingleStore. This column will be indexed at a later step in this demo to showcase ANN search.

Environment configuration

Necessary Python packages like pdfminer.six for PDF processing and openai for accessing the OpenAI API are installed. The OpenAI API key is securely inputted to authenticate and initiate the OpenAI client.

Resume processing: Reading and cleaning the resume PDF

The first step in processing a candidate's resume is to read and clean the PDF. This involves extracting text and removing any non-essential characters or formatting issues that could impede analysis.

Extracting key information

Using a custom function, pinfo_extractor, the process extracts crucial details like the candidate's name, email, phone number, years of experience and technical skills set from the cleaned resume text.

Generating resume embeddings

The next step involves generating embeddings for the entire resume text. Embeddings are vector representations of text, enabling the AI to understand and compare different documents semantically.

Job Description (JD) processing: Receiving and processing the JD

The job description is received as input, and embeddings are generated using the same model. This ensures that the JD and resumes are in a comparable format.

Matching JD with resume embeddings

A dot_product operation is performed against the database to match JD embeddings with resume embeddings, facilitating an intelligent and precise matching process. The IVF_Flat vector index is utilized for this ANN search.

Evaluating the matches: Iterating over the result set

For each resume in the top matching results, the resume and JD are passed to the Language Model (LM). This step is crucial for assessing the suitability of a candidate for the specific role.

Prompt engineering for suitability assessment

Prompts are engineered to ask the LM whether the resume is a good match for the JD. This involves a nuanced understanding of both the resume and the JD, utilizing the power of OpenAI's language models.

Returning a response

Based on the LM's analysis a response is generated — indicating whether the resume is suitable for the JD or not. This step marks the culmination of the automated evaluation process, providing recruiters with valuable insights into candidate suitability.

the-codeThe code

1. Create a workspace in your workspace group inside SingleStore (S-00 is sufficient).

2. Create a database named resume_evaluator and a table resumes_profile_data in SingleStore.

DROP DATABASE IF EXISTS resume_evaluator;
CREATE DATABASE resume_evaluator;
create table IF NOT EXISTS resume_evaluator.resumes_profile_data(
names text,
email text,
phone_no text,
years_of_experience text,
skills text,
profile_name text,
resume_summary text,
resume_embeddings vector(1536)
);

3. Install and import required libraries. In this section, we will set up the necessary environment by installing some important libraries. To extract text from resume PDFs, we'll be using pdfminer.six. To interact with OpenAI's LLM and manage our data efficiently, openai will be instrumental. The install process may take a couple minutes.

!pip install -q pdfminer.six openai
import os
import requests
import re
import getpass
from pdfminer.high_level import extract_text
import openai
from openai import OpenAI
import pandas as pd
import numpy as np
from sqlalchemy import text, create_engine

4. Create a function called get_embedding(). In our workflow, we need a consistent way to transform textual content into vector embeddings. To achieve this, we introduce the get_embedding() function.

This function takes in a piece of text and by default, uses the "text-embedding-ada-002" model to produce embeddings. We ensure that any newline characters in the text are replaced with spaces to maintain the integrity of the input. The function then leverages OpenAI's API to generate and retrieve the embedding for the given text.

api_key = getpass.getpass('OpenAI API Key: ')
openai.api_key = api_key
client = OpenAI(api_key = api_key)
def get_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
response = openai.embeddings.create(input=[text], model=model)
return response.data[0].embedding

5. Create a function called print_pdf_text(). This function is designed to extract and clean the text from a provided PDF, either from a web URL or a local file path.

Input:

  • url: Web URL of the PDF (optional)
  • file_path: Local path of the PDF (optional)

Functionality:

  • Source determination
    • Fetches PDF from url or uses the local file_path
  • Text extraction. Extracts text from the PDF using pdfminer
  • Text cleaning
    • Removes special characters, retaining only "@", "+," ".," and "/"
    • Improves formatting by handling newline characters.
  • Cleanup. If a temporary file was created from a URL, it gets deleted post-processing.
  • Output. Returns the cleaned and formatted text from the PDF.

def print_pdf_text(url=None, file_path=None):
# Determine the source of the PDF (URL or local file)
if url:
response = requests.get(url)
response.raise_for_status() # Ensure the request was successful
temp_file_path = "temp_pdf_file.pdf"
with open(temp_file_path, 'wb') as temp_file:
temp_file.write(response.content) # Save the PDF to a temporary file
pdf_source = temp_file_path
elif file_path:
pdf_source = file_path # Set the source to the provided local file path
else:
raise ValueError("Either url or file_path must be provided.")
# Extract text using pdfminer
text = extract_text(pdf_source)
# Remove special characters except "@", "+", ".", and "/"
cleaned_text = re.sub(r"[^a-zA-Z0-9\s@+./:,]", "", text)
# Format the text for better readability
cleaned_text = cleaned_text.replace("\n\n", " ").replace("\n", " ")
# If a temporary file was used, delete it
if url and os.path.exists(temp_file_path):
os.remove(temp_file_path)
return cleaned_text

6. Create a function called pinfo_extractor(). This function is tailored to extract specific details from a candidate's resume text.

Input:

  • resume_text: The text extracted from a candidate's resume

Functionality:

  • Prompt creation
    • A context is formed using the provided resume text
    • A detailed question prompt is generated to guide the extraction of desired details from the resume
  • OpenAI API interaction

    • Uses the gpt-3.5-turbo model to process the prompt and generate a detailed extraction
    • Extracts relevant sections like Name, Email, Phone Number, and more from the generated response
  • Data structuring
    • The extracted details are organized into a dictionary
  • Output. Returns a dictionary with keys like 'name', 'email', 'phone_no' and more, containing extracted information from the resume.

def pinfo_extractor(resume_text):
context = f"Resume text: {resume_text}"
question = """ From above candidate's resume text, extract the only following
details:
Name: (Find the candidate's full name. If not available, specify "not
available.")
Email: (Locate the candidate's email address. If not available, specify "not
available.")
Phone Number: (Identify the candidate's phone number. If not found, specify "not
available.")
Years of Experience: (If not explicitly mentioned, calculate the years of
experience by analyzing the time durations at each company or position listed. Sum up the
total durations to estimate the years of experience. If not determinable, write "not
available.")
Skills Set: Extract the skills which are purely technical and represent them as:
[skill1, skill2,... <other skills from resume>]. If no skills are provided, state "not
available."
Profile: (Identify the candidate's job profile or designation. If not mentioned,
specify "not available.")
Summary: provide a brief summary of the candidate's profile without using more
than one newline to segregate sections.
"""
prompt = f"""
Based on the below given candidate information, only answer asked question:
{context}
Question: {question}
"""
# print(prompt)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful HR recruiter."},
{"role": "user", "content": prompt}
],
max_tokens=700,
temperature=0.5,
n=1 # assuming you want one generation per document
)
# Extract the generated response
response_text = response.choices[0].message.content #
response['choices'][0]['message']['content']
print(response_text)
# Split the response_text into lines
lines = response_text.strip().split('\n')
# Now, split each line on the colon to separate the labels from the values
# Extract the values
name = lines[0].split(': ')[1]
email = lines[1].split(': ')[1]
phone_no = lines[2].split(': ')[1]
years_of_expiernce = lines[3].split(': ')[1]
skills = lines[4].split(': ')[1]
profile = lines[5].split(': ')[1]
summary = lines[6].split(': ')[1]
data_dict = {
'name': name,
'email': email,
'phone_no': phone_no,
'years_of_expiernce': years_of_expiernce,
'skills': skills,
'profile': profile,
'summary': summary
}
print(data_dict, "\n")
return data_dict;

7. Create a function called add_data_to_db(). This function is designed to add extracted resume details into a database.

Input:

  • input_dict: Dictionary containing details like 'name', 'email', 'phone_no' and more extracted from a resume

Functionality:

  • Database connection
    • Establishes a connection to the database using SQLAlchemy's create_engine with the given connection URL
  • Embedding creation
    • Calls the get_embedding() function to generate an embedding for the resume summary

  • SQL query formation
    • Crafts an SQL query to insert the provided data (from the input dictionary) into the resumes_profile_data table in the database
  • Data insertion
    • Opens a connection, executes the SQL query, commits the changes and closes the connection.
  • Output. Prints a confirmation message upon successful data insertion.
def add_data_to_db(input_dict):
# Create the SQLAlchemy engine
# engine =
create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database_name}')
engine = create_engine(connection_url)
# Get the embedding for the summary text
summary = input_dict['summary']
embedding = get_embedding(summary)
# Create the SQL query for inserting the data
query_sql = f"""
INSERT INTO resumes_profile_data (names, email, phone_no,
years_of_experience, skills, profile_name, resume_summary, resume_embeddings)
VALUES ("{input_dict['name']}", "{input_dict['email']}",
"{input_dict['phone_no']}", "{input_dict['years_of_expiernce']}",
"{input_dict['skills']}", "{input_dict['profile']}",
"{summary}",'{embedding}');
"""
with engine.connect() as connection:
connection.execute(text(query_sql))
connection.commit()
print("\nData Written to resumes_profile_data_2 table")
connection.close()

8. Create a function called search_resumes(). This function facilitates the search for resumes that are most similar to a given query, leveraging embeddings and database operations.

Input:

  • query: A string that represents the job description or any other search criteria.

Functionality:

  • Embedding creation
    • Converts the given query into its corresponding embedding using the get_embedding() function
  • SQL query formation
    • Creates an SQL query to search for the top five resumes in the resumes_profile_data table that have the highest similarity (dot product) to the query embedding.
  • Database operations
    • Opens a connection to the database, runs the SQL query to fetch the results and closes the connection.
  • Output. Returns a list of the top five most relevant resumes based on the given query.

def search_resumes(query):
query_embed = get_embedding(query)
query_sql = f"""
SELECT names, resume_summary, dot_product(
'{query_embed}':> VECTOR(1536),
resume_embeddings
) AS similarity
FROM resumes_profile_data
ORDER BY similarity DESC
LIMIT 5;
"""
# print(query_sql,"\n")
# engine =
create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database_name}')
engine = create_engine(connection_url)
connection = engine.connect()
result = connection.execute(text(query_sql)).fetchall()
connection.close()
engine.dispose()
return result

9. Create a function called evaluate_candidates(). This function is geared toward evaluating the compatibility of candidates' resumes in relation to a specific job description.

Input:

  • query: A string that represents the job description against which candidates' resumes will be assessed

Functionality:

  • Resume retrieval
    • Utilizes the search_resumes() function to get the top matching resumes based on the job description
  • OpenAI API interaction
    • For each retrieved resume, a prompt is crafted, asking to evaluate how well the candidate fits the job description
  • Data aggregation
    • Collects the model's evaluation responses for each candidate in a list
  • Output. Returns a list of tuples, where each tuple contains:
    • Candidate's name
    • Evaluation response from the model, describing the compatibility of the candidate with the given job description

def evaluate_candidates(query):
result = search_resumes(query)
responses = [] # List to store responses for each candidate
for resume_str in result:
name = resume_str[0]
context = f"Resume text: {resume_str[1]}"
question = f"What percentage of the job requirements does the candidate meet
for the following job description? answer in 3 lines only and be effcient while answering:
{query}."
prompt = f"""
Read below candidate information about the candidate:
{context}
Question: {question}
"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a expert HR analyst and recuriter."},
{"role": "user", "content": prompt}
],
max_tokens=100,
temperature=0.2,
n=1 # assuming you want one generation per document
)
# Extract the generated response
response_text = response.choices[0].message.content #
response['choices'][0]['message']['content']
responses.append((name, response_text)) # Append the name and response_text to
the responses list
return responses

Here is how you can run it:

urls = [
"https://github.com/your_user_name/Your_Resume.pdf"
]
for url in urls:
resume_text = print_pdf_text(url=url).replace('\n',' ')
print("Resume Text extracted\n")
ip_data_dict = pinfo_extractor(resume_text)
print("Information extracted\n")
add_data_to_db(ip_data_dict)
print("\n")
%%sql
ALTER TABLE resumes_profile_data
ADD VECTOR INDEX ivf_flat (new_resume_embeddings) INDEX_OPTIONS
'{"index_type":"IVF_FLAT"}';

conclusionConclusion

The integration of SingleStore's database and OpenAI's language models offers a revolutionary approach to resume evaluation in HR. This solution not only streamlines the recruitment process but also enhances the accuracy of candidate selection, enabling HR professionals to focus on strategic decision making rather than manual data processing. By embracing this advanced technological combination, organizations can significantly improve their talent acquisition and management capabilities.


Share