New

Real Time Recommendation Engine

Notebook

SingleStore Notebooks

Real Time Recommendation Engine

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

How to build a real-time recommendation engine with SingleStore & Vercel

We will demonstrate how to build a modern real-time AI application for free using a Shared Tier Database, SingleStore Notebooks, and Job Service.

A Free SingleStore Starter Workspace enables you to execute hybrid search, real-time analytics, and point read/writes/updates in a single database. With SingleStore Notebooks and our Job Service, you easily bring in data from various sources (APIs, MySQL / Mongo endpoints) in real-time. You can also execute Python-based transforms, such as adding embeddings, ensuring that real-time data is readily available for your downstream LLMs and applications.

We will showcase the seamless transition from a prototype to an end-application using SingleStore. The final application will be hosted on Vercel. You can see the App we've built following this notebook here

Architecture:

Screenshot 2024-01-12 at 2.13.37 PM.png

Scenario:

Building a recommendation engine on what LLM you should be using for your use-case. Bringing together semantic search + real-time analytics on the performance of the LLM to make the recommendations.

Here are the requirements we've set out for this recommendation engine:

  1. Pull data from Hugging Face Leaderboard on various Open source LLM models and their scores. Pull updated scores on these models every hour.

  2. For each of these models, pull data from Twitter and Github on what developers are saying about these models, and how they are being used in active projects. Pull this data every hour.

  3. Provide an easy 'search' interface to users where they can describe their use-case. When users provide describe their use-case, perform a hybrid search (vector + full-text search) across the descriptions of these models, what users are saying about it on Twitter, and which github repos are using these LLMs.

  4. Combine the results of the semantic search with analytics on the public benchmarks, # likes, # downloads of these models.

  5. Power the app entirely on a single SingleStore Free Shared Tier Workspace.

  6. Ensure that all of the latest posts / scores are reflected in the App. Power this entirely with SingleStore Notebook and Job Service

Contents

  • Step 1: Creating a Starter Workspace

  • Step 2: Installing & Importing required libraries

  • Step 3: Setting Key Variables

  • Step 4: Designing your table scheama on SingleStore

  • Step 5: Creating Helper Functions to load data into SingleStore

  • Step 6: Loading data with embeddings into SingleStore

  • Step 7: Building the Recommendation Engine Algorithm on Vercel

Step 1. Create a Starter Workspace

Create a new Workpsace Group and select a Starter Workspace. If you do not have this enabled email pm@singlestore.com

Step 2. Install and import required libraries

In [1]:

%pip install singlestoredb openai tiktoken beautifulsoup4 pandas python-dotenv Markdown praw tweepy --quiet
import re
import json
import openai
import tiktoken
import json
import requests
import getpass
import pandas as pd
import singlestoredb as s2
import tweepy
import praw
from bs4 import BeautifulSoup
from markdown import markdown
from datetime import datetime
from time import time, sleep

Step 3. Seting Environment variables

3.1. Set the app common variables. Do not change these

In [2]:

MODELS_LIMIT = 100
MODELS_TABLE_NAME = 'models'
MODEL_READMES_TABLE_NAME = 'model_readmes'
MODEL_TWITTER_POSTS_TABLE_NAME = 'model_twitter_posts'
MODEL_REDDIT_POSTS_TABLE_NAME = 'model_reddit_posts'
MODEL_GITHUB_REPOS_TABLE_NAME = 'model_github_repos'
LEADERBOARD_DATASET_URL = 'https://llm-recommender.vercel.app/datasets/leaderboard.json'
TOKENS_LIMIT = 2047
TOKENS_TRASHHOLD_LIMIT = TOKENS_LIMIT - 128

3.2. Set the OpenAI variables

We will be using OpenAI's embedding models to create vectors representing our data. The vectors will be stored in the SingleStore Starter Workspace as a column in the relevant tables.

Using OpenAI's LLMs we will also generate output text after we complete the Retrieval Augmentation Generation Steps.

  1. Open the OpenAI API keys page

  2. Create a new key

  3. Copy the key and paste it into the OPENAI_API_KEY variable

In [3]:

OPENAI_API_KEY = getpass.getpass("enter openAI apikey here")

3.3. Set the HuggingFace variables

We will be pulling data from HugginFace about the different models, the usage of these models, and how they score in several evaluation metrics.

  1. Open the HuggingFace Access Tokens page

  2. Create a new token

  3. Copy the key and paste it into the HF_TOKEN variable

In [4]:

HF_TOKEN = getpass.getpass("enter HuggingFace apikey here")

3.4. Set the Twitter variables

We will be pulling data from Twitter about what users might be saying about these models. Since teh quality of these models may change over time, we want to caputre the sentiment of what people are talking about and using on twitter.

  1. Open the Twitter Developer Projects & Apps page

  2. Add a new app

  3. Fill the form

  4. Generate a Bearer Token and paste it into the TWITTER_BEARER_TOKEN variable

In [5]:

TWITTER_BEARER_TOKEN = getpass.getpass("enter Twitter Bearer Token here")

3.5 Set the GitHub variables

We will also be pulling data from various Github repos on which models are being referenced and used for which scenarios.

  1. Open the Register new GitHub App page

  2. Fill the form

  3. Get an access token and paste it into the GITHUB_ACCESS_TOKEN variable

In [6]:

GITHUB_ACCESS_TOKEN = getpass.getpass("enter Github Access Token here")

Step 4. Designing and creating your table schemas in SingleStore

We will be storing all of this data in a single Free Shared Tier Database. Through this database, you can write hybrid search queries, run analytics on the model's performance, and get real-time reads/updates.

  • connection - database connection to execute queries

  • create_tables - function that creates empty tables in the database

  • drop_table - helper function to drop a table

  • get_models - helper function to get models from the models table

  • db_get_last_created_at - helper function to get last created_at value from a table

The create_tables creates the following tables:

  • models_table - table with all models data from the Open LLM Leaderboard

  • readmes_table - table with model readme texts from the HugginFace model pages (used in semantic search)

  • twitter_posts - table with tweets related to models (used in semantic search)

  • github_repos - table with GitHub readme texts related to models (used in semantic search)

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [7]:

connection = s2.connect(connection_url)
def create_tables():
def create_models_table():
with connection.cursor() as cursor:
cursor.execute(f'''
CREATE TABLE IF NOT EXISTS {MODELS_TABLE_NAME} (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(512) NOT NULL,
author VARCHAR(512) NOT NULL,
repo_id VARCHAR(1024) NOT NULL,
score DECIMAL(5, 2) NOT NULL,
arc DECIMAL(5, 2) NOT NULL,
hellaswag DECIMAL(5, 2) NOT NULL,
mmlu DECIMAL(5, 2) NOT NULL,
truthfulqa DECIMAL(5, 2) NOT NULL,
winogrande DECIMAL(5, 2) NOT NULL,
gsm8k DECIMAL(5, 2) NOT NULL,
link VARCHAR(255) NOT NULL,
downloads INT,
likes INT,
still_on_hub BOOLEAN NOT NULL,
created_at TIMESTAMP,
embedding BLOB
)
''')
def create_model_readmes_table():
with connection.cursor() as cursor:
cursor.execute(f'''
CREATE TABLE IF NOT EXISTS {MODEL_READMES_TABLE_NAME} (
id INT AUTO_INCREMENT PRIMARY KEY,
model_repo_id VARCHAR(512),
text LONGTEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
clean_text LONGTEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
created_at TIMESTAMP,
embedding BLOB
)
''')
def create_model_twitter_posts_table():
with connection.cursor() as cursor:
cursor.execute(f'''
CREATE TABLE IF NOT EXISTS {MODEL_TWITTER_POSTS_TABLE_NAME} (
id INT AUTO_INCREMENT PRIMARY KEY,
model_repo_id VARCHAR(512),
post_id VARCHAR(256),
clean_text LONGTEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
created_at TIMESTAMP,
embedding BLOB
)
''')
def create_model_github_repos_table():
with connection.cursor() as cursor:
cursor.execute(f'''
CREATE TABLE IF NOT EXISTS {MODEL_GITHUB_REPOS_TABLE_NAME} (
id INT AUTO_INCREMENT PRIMARY KEY,
model_repo_id VARCHAR(512),
repo_id INT,
name VARCHAR(512) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
description TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
clean_text LONGTEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
link VARCHAR(256),
created_at TIMESTAMP,
embedding BLOB
)
''')
create_models_table()
create_model_readmes_table()
create_model_twitter_posts_table()
create_model_github_repos_table()
def drop_table(table_name: str):
with connection.cursor() as cursor:
cursor.execute(f'DROP TABLE IF EXISTS {table_name}')
def get_models(select='*', query='', as_dict=True):
with connection.cursor() as cursor:
_query = f'SELECT {select} FROM {MODELS_TABLE_NAME}'
if query:
_query += f' {query}'
cursor.execute(_query)
if as_dict:
columns = [desc[0] for desc in cursor.description]
return [dict(zip(columns, row)) for row in cursor.fetchall()]
return cursor.fetchall()
def db_get_last_created_at(table, repo_id, to_string=False):
with connection.cursor() as cursor:
cursor.execute(f"""
SELECT UNIX_TIMESTAMP(created_at) FROM {table}
WHERE model_repo_id = '{repo_id}'
ORDER BY created_at DESC
LIMIT 1
""")
rows = cursor.fetchone()
created_at = float(rows[0]) if rows and rows[0] else None
if (created_at and to_string):
created_at = datetime.fromtimestamp(created_at)
created_at = created_at.strftime('%Y-%m-%dT%H:%M:%SZ')
return created_at

Step 5. Creating helper functions to load data into SingleStore

5.1. Setting up the openai.api_key

In [8]:

openai.api_key = OPENAI_API_KEY

5.2. Create the create_embeddings function

This function will be used to create embeddings on data based on an input to the function. We will be doing this to all data pulled from Github, HuggingFace and Twitter. The vector embeddings created will be stored in the same SingleStore table as a separate column.

In [9]:

def count_tokens(text: str):
enc = tiktoken.get_encoding('cl100k_base')
return len(enc.encode(text, disallowed_special={}))
def create_embedding(input):
try:
data = openai.embeddings.create(input=input, model='text-embedding-ada-002').data
return data[0].embedding
except Exception as e:
print(e)
return [[]]

5.3. Create the function/Utils to help parse the data ingested from the various sources

This is a set of functions that ensure the JSON is in the right format and can be stored in SingleStore as a JSON column. In your Free Shared Tier workspace you can bring data of various formats (JSON, Geospatial, Vector) and interact with this data with SQL and MongoDB API.

In [10]:

class JSONEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime):
return obj.strftime('%Y-%m-%d %H:%M:%S')
return super().default(obj)
def list_into_chunks(lst, chunk_size=100):
return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]
def string_into_chunks(string: str, max_tokens=TOKENS_LIMIT):
if count_tokens(string) <= max_tokens:
return [string]
delimiter = ' '
words = string.split(delimiter)
chunks = []
current_chunk = []
for word in words:
if count_tokens(delimiter.join(current_chunk + [word])) <= max_tokens:
current_chunk.append(word)
else:
chunks.append(delimiter.join(current_chunk))
current_chunk = [word]
if current_chunk:
chunks.append(delimiter.join(current_chunk))
return chunks
def clean_string(string: str):
def strip_html_elements(string: str):
html = markdown(string)
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
return text.strip()
def remove_unicode_escapes(string: str):
return re.sub(r'[^\x00-\x7F]+', '', string)
def remove_string_spaces(strgin: str):
new_string = re.sub(r'\n+', '\n', strgin)
new_string = re.sub(r'\s+', ' ', new_string)
return new_string
def remove_links(string: str):
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
return re.sub(url_pattern, '', string)
new_string = strip_html_elements(string)
new_string = remove_unicode_escapes(new_string)
new_string = remove_string_spaces(new_string)
new_string = re.sub(r'\*\*+', '*', new_string)
new_string = re.sub(r'--+', '-', new_string)
new_string = re.sub(r'====+', '=', new_string)
new_string = remove_links(new_string)
return new_string

Step 6. Loading Data into SingleStore

6.1. Load Data on all Open-Source LLM models from HuggingFace Leaderboard

This function loads a pre-generated Open LLM Leaderboard dataset. Based on this dataset, all model data is created and inserted into the database. We will also create embeddings for all of this data pulled using the OpenAI Embedding Model.

In [11]:

def leaderboard_get_df():
response = requests.get(LEADERBOARD_DATASET_URL)
if response.status_code == 200:
data = json.loads(response.text)
df = pd.DataFrame(data).head(MODELS_LIMIT)
return df
else:
print("Failed to retrieve JSON file")
def leaderboard_insert_model(model):
try:
_model = {key: value for key, value in model.items() if key != 'readme'}
to_embedding = json.dumps(_model, cls=JSONEncoder)
embedding = str(create_embedding(to_embedding))
model_to_insert = {**_model, embedding: embedding}
readmes_to_insert = []
if model['readme']:
readme = {
'model_repo_id': model['repo_id'],
'text': model['readme'],
'created_at': time()
}
if count_tokens(readme['text']) <= TOKENS_TRASHHOLD_LIMIT:
readme['clean_text'] = clean_string(readme['text'])
to_embedding = json.dumps({
'model_repo_id': readme['model_repo_id'],
'clean_text': readme['clean_text'],
})
readme['embedding'] = str(create_embedding(to_embedding))
readmes_to_insert.append(readme)
else:
for i, chunk in enumerate(string_into_chunks(readme[