New

Demonstrate ML function Classify

Notebook

SingleStore Notebooks

Demonstrate ML function Classify

Note

You can use your existing Standard or Premium workspace with this Notebook.

This feature is currently in Private Preview. Please reach out to support@singlestore.com to confirm if this feature can be enabled in your org.

This Jupyter notebook will help you:

Load the titanic dataset
Store the data in a SingleStore table
Use ML Functions for training and predictions
Run some common Data Analysis tasks

Prerequisites: Ensure ML Functions are installed on your deployment (AI > AI & ML Functions).

In [1]:

1%%sql2-- Ensure that ML_CLASSIFY is listed in Functions_in_cluster column3show functions in cluster;

In [2]:

1!pip install -q httplib2 seaborn pandas numpy scikit-learn

Load and Prepare the Titanic Dataset

We'll use the famous Titanic dataset from seaborn, which contains passenger information from the RMS Titanic. The goal is to predict whether a passenger survived based on features like age, sex, ticket class, and fare.

In [3]:

1%%sql2CREATE DATABASE IF NOT EXISTS temp;3USE temp;

In [4]:

1import json2import seaborn as sns3import pandas as pd4import numpy as np5from sklearn.model_selection import train_test_split6
7# Load the Titanic dataset8titanic_df = sns.load_dataset('titanic')9
10# Display basic information11print(f"Dataset shape: {titanic_df.shape}")12print(f"\nColumn names: {list(titanic_df.columns)}")13print(f"\nFirst 5 rows:")14print(titanic_df.head())15
16# Check survival distribution17print(f"\nSurvival Distribution:")18print(titanic_df['survived'].value_counts())

Clean and Prepare Features

We'll select the most important features and handle missing values to create a clean dataset for training.

In [5]:

1# Select relevant columns for prediction2columns_to_use = ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']3titanic_clean = titanic_df[columns_to_use].copy()4
5# Fill missing values6titanic_clean['age'] = titanic_clean['age'].fillna(titanic_clean['age'].median())7titanic_clean['fare'] = titanic_clean['fare'].fillna(titanic_clean['fare'].median())8titanic_clean['embarked'] = titanic_clean['embarked'].fillna('S')  # Most common port9
10# Drop any remaining rows with missing values11titanic_clean = titanic_clean.dropna()12
13# Convert survived to text labels for classification14titanic_clean['survival_status'] = titanic_clean['survived'].map({15    0: 'Died',16    1: 'Survived'17})18
19# Drop the original numeric survived column20titanic_clean = titanic_clean.drop('survived', axis=1)21
22print(f"Clean dataset shape: {titanic_clean.shape}")23print(f"\nMissing values per column:")24print(titanic_clean.isnull().sum())25print(f"\nSurvival status distribution:")26print(titanic_clean['survival_status'].value_counts())27print(f"\nFirst 5 rows of clean data:")28print(titanic_clean.head())

Split Data into Training and Test Sets

We'll split the data into 80% training and 20% test sets to evaluate model performance.

In [6]:

1# Split into train (80%) and test (20%) sets2train_df, test_df = train_test_split(3    titanic_clean,4    test_size=0.2,5    random_state=42,6    stratify=titanic_clean['survival_status']7)8
9print(f"Training set size: {len(train_df)} passengers")10print(f"Test set size: {len(test_df)} passengers")11print(f"\nTraining set survival distribution:")12print(train_df['survival_status'].value_counts())13print(f"\nTest set survival distribution:")14print(test_df['survival_status'].value_counts())

In [7]:

1%%sql2DROP TABLE IF EXISTS titanic_training_data;3DROP TABLE IF EXISTS titanic_test_data;4DROP TABLE IF EXISTS titanic_predictions;5
6CREATE TABLE titanic_training_data (7    pclass INT,8    sex VARCHAR(10),9    age FLOAT,10    sibsp INT,11    parch INT,12    fare FLOAT,13    embarked VARCHAR(1),14    survival_status VARCHAR(10)15);16
17CREATE TABLE titanic_test_data (18    pclass INT,19    sex VARCHAR(10),20    age FLOAT,21    sibsp INT,22    parch INT,23    fare FLOAT,24    embarked VARCHAR(1),25    survival_status VARCHAR(10)26);27
28CREATE TABLE titanic_predictions (29    pclass INT,30    sex VARCHAR(10),31    age FLOAT,32    sibsp INT,33    parch INT,34    fare FLOAT,35    embarked VARCHAR(1),36    actual_status VARCHAR(10),37    predicted_status JSON38);

Load Data into SingleStore Tables

We'll use pandas to insert the training and test data into our SingleStore tables.

In [8]:

1import singlestoredb as s22
3# Create engine with database specified4engine = s2.create_engine(database='temp')5
6# Insert training data7train_df.to_sql(8    'titanic_training_data',9    con=engine,10    if_exists='append',11    index=False,12    method='multi'13)14
15# Insert test data16test_df.to_sql(17    'titanic_test_data',18    con=engine,19    if_exists='append',20    index=False,21    method='multi'22)23
24print(f"Inserted {len(train_df)} rows into titanic_training_data")25print(f"Inserted {len(test_df)} rows into titanic_test_data")

Verify Data Load

Let's verify that our data was loaded correctly and review the passenger demographics.

In [9]:

1%%sql2SELECT COUNT(*) as training_count FROM titanic_training_data;

In [10]:

1%%sql2SELECT COUNT(*) as test_count FROM titanic_test_data;

In [11]:

1%%sql2SELECT3    survival_status,4    COUNT(*) as passenger_count,5    ROUND(AVG(age), 1) as avg_age,6    ROUND(AVG(fare), 2) as avg_fare7FROM titanic_training_data8GROUP BY survival_status;9SELECT * FROM titanic_training_data LIMIT 5;

Train the ML Classification Model

Now we'll train an ML model using the %s2ml train magic command. This will use SingleStore's ML Functions to train a classification model that predicts passenger survival.

Note: Training may take several minutes depending on the compute size selected. The model will learn patterns like "women and children first" and the impact of ticket class on survival.

In [12]:

1%%s2ml train as training_result2task: classification3model: titanic_survival_predictor4db: temp5input_table: titanic_training_data6target_column: survival_status7description: "Titanic passenger survival prediction based on demographics and ticket info"8runtime: cpu-small9selected_features: {"mode":"*","features":null}10force: True

Check Training Results

The training result is assigned to the variable training_result. Let's examine the training details.

In [13]:

1# Display the training result2print(json.dumps(training_result, indent=4))

Monitor Training Status

Use the %s2ml status command to view the model details and training status. The status will be one of: Pre-processing, Training, Completed, or Error.

In [14]:

1%s2ml status --model titanic_survival_predictor

Note

Wait for training to complete before proceeding to the next section

You can re-run the cell above to check the status. Once the pipeline_status shows "Ready", you can proceed with predictions.

Run Sample Predictions

Once training is complete, let's run predictions on a few sample passengers from our test dataset to see how the model performs. Ensure that you still are using the right database selected

In [15]:

1%%sql2Use temp;

In [16]:

1%%sql2SELECT3    cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status,4    passenger.survival_status as actual_status,5    passenger.pclass as ticket_class,6    passenger.sex,7    passenger.age,8    passenger.fare9FROM (SELECT * FROM titanic_test_data LIMIT 10) AS passenger;

Run Predictions on Full Test Dataset

Now let's run predictions on the entire test dataset and store the results in our predictions table.

In [17]:

1%%sql2INSERT INTO titanic_predictions (3    pclass, sex, age, sibsp, parch, fare, embarked,4    actual_status, predicted_status5)6SELECT7    passenger.pclass,8    passenger.sex,9    passenger.age,10    passenger.sibsp,11    passenger.parch,12    passenger.fare,13    passenger.embarked,14    passenger.survival_status as actual_status,15    cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status16FROM titanic_test_data AS passenger;

Evaluate Model Performance

Let's analyze the prediction accuracy by comparing actual vs predicted survival status.

In [18]:

1%%sql2SELECT3    COUNT(*) as total_predictions,4    SUM(CASE WHEN actual_status = JSON_EXTRACT_STRING(predicted_status, 'predicted_label') THEN 1 ELSE 0 END) as correct_predictions,5    ROUND(100.0 * SUM(CASE WHEN actual_status = JSON_EXTRACT_STRING(predicted_status, 'predicted_label') THEN 1 ELSE 0 END) / COUNT(*), 2) as accuracy_percentage6FROM titanic_predictions;

Analyze Survival Factors

Let's examine how different passenger characteristics influenced survival predictions.

In [19]:

1%%sql2-- Survival rate by sex3SELECT4    sex,5    COUNT(*) as total_passengers,6    SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) as actual_survivors,7    ROUND(100.0 * SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) / COUNT(*), 1) as survival_rate_pct8FROM titanic_predictions9GROUP BY sex10ORDER BY survival_rate_pct DESC;

In [20]:

1%%sql2-- Survival rate by passenger class3SELECT4    pclass as ticket_class,5    COUNT(*) as total_passengers,6    SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) as actual_survivors,7    ROUND(100.0 * SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) / COUNT(*), 1) as survival_rate_pct,8    ROUND(AVG(fare), 2) as avg_fare_paid9FROM titanic_predictions10GROUP BY pclass11ORDER BY pclass;

Examine Misclassified Passengers

Let's look at passengers where the model made incorrect predictions to understand potential model limitations.

In [21]:

1%%sql2SELECT3    actual_status,4    JSON_EXTRACT_STRING(predicted_status, 'predicted_label') as predicted_label,5    JSON_EXTRACT_DOUBLE(predicted_status, 'confidence') as confidence,6    pclass as ticket_class,7    sex,8    age,9    sibsp as siblings_spouses,10    parch as parents_children,11    fare,12    embarked13FROM titanic_predictions14WHERE actual_status != JSON_EXTRACT_STRING(predicted_status, 'predicted_label')15LIMIT 15;

Cleanup

In [22]:

1%%sql2DROP TABLE IF EXISTS titanic_training_data;3DROP TABLE IF EXISTS titanic_test_data;4DROP TABLE IF EXISTS titanic_predictions;5DROP DATABASE IF EXISTS temp;

Details

About this Template

Learn how to train an ML Classify model and run it to predict the class of an input row.

This Notebook can be run in Standard and Enterprise deployments.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

License

This Notebook has been released under the Apache 2.0 open source license.

Demonstrate ML function Classify

Notebook

Demonstrate ML function Classify

Load and Prepare the Titanic Dataset

Clean and Prepare Features

Split Data into Training and Test Sets

Load Data into SingleStore Tables

Verify Data Load

Train the ML Classification Model

Check Training Results

Monitor Training Status

Run Sample Predictions

Run Predictions on Full Test Dataset

Evaluate Model Performance

Analyze Survival Factors

Examine Misclassified Passengers

Cleanup

Details

About this Template

Tags

See Notebook in action

License