New

Semantic Search with OpenAI Embedding Creation

Notebook

SingleStore Notebooks

Semantic Search with OpenAI Embedding Creation

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

In this notebook, we will demonstrate an example of conducting semantic search on SingleStoreDB with SQL! Unlike traditional keyword-based search methods, semantic search algorithms take into account the relationships between words and their meanings, enabling them to deliver more accurate and relevant results – even when search terms are vague or ambiguous.

SingleStoreDB’s built-in parallelization and Intel SIMD-based vector processing takes care of the heavy lifting involved in processing vector data. This allows your to run your ML algorithms right in your database extremely efficiently with just 2 lines of SQL!

In this example, we use Open AI embeddings API to create embeddings for our dataset and run semantic_search using dot_product vector matching function!

1. Create a workspace in your workspace group

S-00 is sufficient.

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

2. Create a Database named semantic_search

In [1]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS semantic_search;
%sql CREATE DATABASE semantic_search;

Action Required

Make sure to select the semantic_search database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.

3. Install and import required libraries

We will use the OpenAI embeddings API and will need to import the relevant dependencies accordingly.

In [2]:

!pip3 install openai==1.3.3 requests --quiet
import json
import os
from openai import OpenAI
import requests

4. Create an OpenAI account and get API connection details

To vectorize and embed the employee reviews and query strings, we leverage OpenAI's embeddings API. To use this API, you will need an API key, which you can get here. You'll need to add a payment method to actually get vector embeddings using the API, though the charges are minimal for a small example like we present here.

Action Required

You will have to update your notebook's firewall settings to include *.*.openai.com in order to get embedddings from OpenAI APIS.

In [3]:

import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass('OpenAI API Key: ')
client = OpenAI()

5. Create a new table in your database called reviews

In [4]:

%%sql
DROP TABLE IF EXISTS reviews;
CREATE TABLE reviews (
date_review VARCHAR(255),
job_title VARCHAR(255),
location VARCHAR(255),
review TEXT
);

6. Import our sample data into your table

This dataset has 15 reviews left by anonymous employees of a firm.

In [5]:

url = 'https://raw.githubusercontent.com/singlestore-labs/singlestoredb-samples/main/Tutorials/ai-powered-semantic-search/hr_sample_data.sql'

Note that we are using the %sql magic command here to run a query against the currently selected database.

In [6]:

for query in [x for x in requests.get(url).text.split('\n') if x.strip()]:
%sql {{query}}

7. Add vector embeddings for each review

To embed the reviews in our SingleStoreDB database, we iterate through each row in the table, make a call to OpenAI’s embeddings API with the text in the reviews field and update the new column called embeddings for each entry.

In [7]:

%sql ALTER TABLE reviews ADD embeddings BLOB;
from typing import List
reviews = %sql SELECT review FROM reviews;
reviews = [x.review for x in reviews]
def get_embeddings(inputs: List[str], model: str = 'text-embedding-ada-002') -> List[str]:
"""Return list of embeddings."""
return [x.embedding for x in client.embeddings.create(input=inputs, model=model).data]
embeddings = get_embeddings(reviews)
for embedding, review in zip(embeddings, reviews):
%sql UPDATE reviews SET embeddings = JSON_ARRAY_PACK('{{json.dumps(embedding)}}') WHERE review='{{review}}';

8. Run the semantic search algorithm with just one line of SQL

We will utilize SingleStoreDB's distributed architecture to efficiently compute the dot product of the input string (stored in searchstring) with each entry in the database and return the top 5 reviews with the highest dot product score. Each vector is normalized to length 1, hence the dot product function essentially computes the cosine similarity between two vectors – an appropriate nearness metric. SingleStoreDB makes this extremely fast because it compiles queries to machine code and runs dot_product using SIMD instructions.

In [8]:

searchstring = input('Please enter a search string: ')
search_embedding = json.dumps(get_embeddings([searchstring])[0])
results = %sql SELECT review, DOT_PRODUCT(embeddings, JSON_ARRAY_PACK('{{search_embedding}}')) AS score FROM reviews ORDER BY score DESC LIMIT 5;
print()
for i, res in enumerate(results):
print(f'{i + 1}: {res.review} Score: {res.score}\n')

9. Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [9]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS semantic_search;

Details

Tags

#starter#openai#vectordb#genai

License

This Notebook has been released under the Apache 2.0 open source license.