New

Movie Recommendation

Notebook

SingleStore Notebooks

Movie Recommendation

Note

This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace

Source: Full MovieLens 25M Dataset - Appplication

This notebook demonstrates how SingleStoreDB helps you build a simple Movie Recommender System.

1. Install required libraries

Install the library for vectorizing the data (up to 2 minutes).

In [1]:

!pip install sentence-transformers --quiet

2. Create database and ingest data

Create the movie_recommender database.

In [2]:

%%sql
DROP DATABASE IF EXISTS movie_recommender;
CREATE DATABASE IF NOT EXISTS movie_recommender;

Action Required

Make sure to select the movie_recommender database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.

Create tags table and start pipeline.

In [3]:

%%sql
CREATE TABLE IF NOT EXISTS tags (
`userId` bigint(20) NULL,
`movieId` bigint(20) NULL,
`tag` text CHARACTER SET utf8 COLLATE utf8_general_ci NULL,
`timestamp` bigint(20) NULL
);
CREATE PIPELINE tags
AS LOAD DATA S3 'studiotutorials/movielens/tags.csv'
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'
BATCH_INTERVAL 2500
MAX_PARTITIONS_PER_BATCH 1
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
SKIP DUPLICATE KEY ERRORS
INTO TABLE `tags`
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n'
NULL DEFINED BY ''
IGNORE 1 LINES
(userId, movieId, tag, timestamp);
START PIPELINE tags;

Create ratings table and start pipeline.

In [4]:

%%sql
CREATE TABLE IF NOT EXISTS ratings (
userId bigint(20) DEFAULT NULL,
movieId bigint(20) DEFAULT NULL,
rating double DEFAULT NULL,
timestamp bigint(20) DEFAULT NULL
);
CREATE PIPELINE ratings
AS LOAD DATA S3 'studiotutorials/movielens/ratings.csv'
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'
BATCH_INTERVAL 2500
MAX_PARTITIONS_PER_BATCH 1
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
SKIP DUPLICATE KEY ERRORS
INTO TABLE `ratings`
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n'
NULL DEFINED BY ''
IGNORE 1 LINES
(userId, movieId, rating, timestamp);
START PIPELINE ratings;

Create movies table and start pipeline.

In [5]:

%%sql
CREATE TABLE movies (
movieId bigint(20) DEFAULT NULL,
title text CHARACTER SET utf8 COLLATE utf8_general_ci,
genres text CHARACTER SET utf8 COLLATE utf8_general_ci,
FULLTEXT(title)
);
CREATE PIPELINE movies
AS LOAD DATA S3 'studiotutorials/movielens/movies.csv'
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'
BATCH_INTERVAL 2500
MAX_PARTITIONS_PER_BATCH 1
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
SKIP DUPLICATE KEY ERRORS
INTO TABLE `movies`
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n'
NULL DEFINED BY ''
IGNORE 1 LINES
(movieId, title, genres);
START PIPELINE movies;

Check that all the data has been loaded

There should be 25m rows for ratings, 62k for movies and 1m for tags. If the values are less than that, try the query again in a few seconds, the pipelines are still running.

In [6]:

%%sql
SELECT COUNT(*) AS count_rows FROM ratings
UNION ALL
SELECT COUNT(*) AS count_rows FROM movies
UNION ALL
SELECT COUNT(*) AS count_rows FROM tags

Concatenate tags and movies tables using all tags

In [7]:

%%sql
CREATE TABLE movies_with_tags AS
SELECT
m.movieId,
m.title,
m.genres,
GROUP_CONCAT(t.tag SEPARATOR ',') AS allTags
FROM movies m
LEFT JOIN tags t ON m.movieId = t.movieId
GROUP BY m.movieId, m.title, m.genres;

3. Vectorize data

Initialize sentence transformer.

In [8]:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

Query the movies_with_tags table and store the output in a variable named result. The result << syntax in the %%sql line indicates that the output from the query should get stored under that variable name.

In [9]:

%%sql result <<
SELECT * FROM movies_with_tags

Convert the result from the above SQL into a DataFrame and clean up quotes.

In [10]:

import pandas as pd
df = pd.DataFrame(result)
# Curate the special characters
df['title'] = df['title'].str.replace('"', '')
df['allTags'] = df['allTags'].str.replace('"', '').str.replace("'", '')
data = df.to_dict(orient='records')

Check the first row of the list.

In [11]:

data[0]

Concatenate title and tags.

In [12]:

all_title_type_column = [f'{row["title"]}-{row["allTags"]}' if row["title"] is not None else row["title"] for row in data]

Create the embeddings for Title & Tag (~3 minutes).

In [13]:

# Remove [:3000] if you want to vectorize all rows (~60 minutes)
all_embeddings = model.encode(all_title_type_column[:3000])
all_embeddings.shape

Merge the original data with the vector data.

In [14]:

# Remember the list will be only 3,000 elements
for row, embedding in zip(data, all_embeddings):
row['embedding'] = embedding

In [15]:

data[0]

4. Create table for movie information and vectors

In [16]:

%%sql
DROP TABLE IF EXISTS movie_with_tags_with_vectors;
CREATE TABLE movie_with_tags_with_vectors (
movieId BIGINT(20) DEFAULT NULL,
title text CHARACTER SET utf8 COLLATE utf8_general_ci,
genres text CHARACTER SET utf8 COLLATE utf8_general_ci,
allTags longtext CHARACTER SET utf8mb4,
vector BLOB
)

Create a database connection using SQLAlchemy. We are going to use an SQLAlchemy connection here because one column of data is numpy arrays. The SingleStoreDB SQLAlchemy driver will automatically convert those to the correct binary format when uploading, so it's a bit more convenient than doing the conversions and formatting manually for the %sql magic command.

In [17]:

from singlestoredb import create_engine
conn = create_engine().connect()

Insert the data. Some rows might encounter errors due to unsupported characters.

In [18]:

import sqlalchemy as sa
sql_query = sa.text('''
INSERT INTO movie_with_tags_with_vectors (
movieId,
title,
genres,
allTags,
vector
)
VALUES (
:movieId,
:title,
:genres,
:allTags,
:embedding
)
''')
conn.execute(sql_query, data[:3000])

5. Marrying Search ❤️ Semantic Search ❤️ Analytics

Build autocomplete search

This is en experimentat we started with to render a full text search.

In [19]:

%%sql
WITH queryouter AS (
SELECT DISTINCT(title), movieId, MATCH(title) AGAINST ('Pocahontas*') as relevance
FROM movies
WHERE MATCH(title) AGAINST ('Pocahontas*')
ORDER BY relevance DESC
LIMIT 10)
SELECT title, movieId FROM queryouter;

Create user favorite movie tables

In [20]:

%%sql
CREATE ROWSTORE TABLE IF NOT EXISTS user_choice (
userid text CHARACTER SET utf8 COLLATE utf8_general_ci,
title text CHARACTER SET utf8 COLLATE utf8_general_ci,
ts datetime DEFAULT NULL,
KEY userid (userid)
)

Enter dummy data for testing purposes.

In [21]:

%%sql
INSERT INTO user_choice (userid, title, ts)
VALUES ('user1', 'Zone 39 (1997)', '2022-01-01 00:00:00'),
('user1', 'Star Trek II: The Wrath of Khan (1982)', '2022-01-01 00:00:00'),
('user1', 'Giver, The (2014)', '2022-01-01 00:00:00');

Build semantic search for a movie recommendation

In [22]:

%%sql
WITH
table_match AS (
SELECT
m.title,
m.movieId,
m.vector
FROM
user_choice t
INNER JOIN movie_with_tags_with_vectors m ON m.title = t.title
WHERE
userid = 'user1'
),
movie_pairs AS (
SELECT
m1.movieId AS movieId1,
m1.title AS title1,
m2.movieId AS movieId2,
m2.title AS title2,
DOT_PRODUCT(m1.vector, m2.vector) AS similarity
FROM
table_match m1
CROSS JOIN movie_with_tags_with_vectors m2
WHERE
m1.movieId != m2.movieId
AND NOT EXISTS (
SELECT
1
FROM
user_choice uc
WHERE
uc.userid = 'user1'
AND uc.title = m2.title
)
),
movie_match AS (
SELECT
movieId1,
title1,
movieId2,
title2,
similarity
FROM
movie_pairs
ORDER BY
similarity DESC
),
distinct_count AS (
SELECT DISTINCT
movieId2,
title2 AS Title,
ROUND(AVG(similarity), 4) AS Rating_Match
FROM
movie_match
GROUP BY
movieId2,
title2
ORDER BY
Rating_Match DESC
),
average_ratings AS (
SELECT
movieId,
AVG(rating) AS Avg_Rating
FROM
ratings
GROUP BY
movieId
)
SELECT
dc.Title,
dc.Rating_Match as 'Match Score',
ROUND(ar.Avg_Rating, 1) AS 'Average User Rating'
FROM
distinct_count dc
JOIN average_ratings ar ON dc.movieId2 = ar.movieId
ORDER BY
dc.Rating_Match DESC
LIMIT
5;

6. What are you looking for?

In [23]:

search_embedding = model.encode("I want see a French comedy movie")

In [24]:

sql_query = sa.text('''
SELECT title, genres, DOT_PRODUCT(vector, :vector) AS score FROM movie_with_tags_with_vectors tv
ORDER BY Score DESC
LIMIT 10
''')
results = conn.execute(sql_query, dict(vector=search_embedding))
for i, res in enumerate(results):
print(f"{i + 1}: {res.title} {res.genres} Score: {res.score}")

Clean up

In [25]:

%%sql
DROP DATABASE IF EXISTS movie_recommender

Details

Tags

#advanced#vectordb#genai#openai

License

This Notebook has been released under the Apache 2.0 open source license.