Demonstrate ML function Classify
Notebook
Note
You can use your existing Standard or Premium workspace with this Notebook.
This feature is currently in Private Preview. Please reach out to support@singlestore.com to confirm if this feature can be enabled in your org.
This Jupyter notebook will help you:
Load the titanic dataset
Store the data in a SingleStore table
Use ML Functions for training and predictions
Run some common Data Analysis tasks
Prerequisites: Ensure ML Functions are installed on your deployment (AI > AI & ML Functions).
In [1]:
1%%sql2-- Ensure that ML_CLASSIFY is listed in Functions_in_cluster column3show functions in cluster;
In [2]:
1!pip install -q httplib2 seaborn pandas numpy scikit-learn
Load and Prepare the Titanic Dataset
We'll use the famous Titanic dataset from seaborn, which contains passenger information from the RMS Titanic. The goal is to predict whether a passenger survived based on features like age, sex, ticket class, and fare.
In [3]:
1%%sql2CREATE DATABASE IF NOT EXISTS temp;3USE temp;
In [4]:
1import seaborn as sns2import pandas as pd3import numpy as np4from sklearn.model_selection import train_test_split5 6# Load the Titanic dataset7titanic_df = sns.load_dataset('titanic')8 9# Display basic information10print(f"Dataset shape: {titanic_df.shape}")11print(f"\nColumn names: {list(titanic_df.columns)}")12print(f"\nFirst 5 rows:")13print(titanic_df.head())14 15# Check survival distribution16print(f"\nSurvival Distribution:")17print(titanic_df['survived'].value_counts())
Clean and Prepare Features
We'll select the most important features and handle missing values to create a clean dataset for training.
In [5]:
1# Select relevant columns for prediction2columns_to_use = ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']3titanic_clean = titanic_df[columns_to_use].copy()4 5# Fill missing values6titanic_clean['age'] = titanic_clean['age'].fillna(titanic_clean['age'].median())7titanic_clean['fare'] = titanic_clean['fare'].fillna(titanic_clean['fare'].median())8titanic_clean['embarked'] = titanic_clean['embarked'].fillna('S') # Most common port9 10# Drop any remaining rows with missing values11titanic_clean = titanic_clean.dropna()12 13# Convert survived to text labels for classification14titanic_clean['survival_status'] = titanic_clean['survived'].map({15 0: 'Died',16 1: 'Survived'17})18 19# Drop the original numeric survived column20titanic_clean = titanic_clean.drop('survived', axis=1)21 22print(f"Clean dataset shape: {titanic_clean.shape}")23print(f"\nMissing values per column:")24print(titanic_clean.isnull().sum())25print(f"\nSurvival status distribution:")26print(titanic_clean['survival_status'].value_counts())27print(f"\nFirst 5 rows of clean data:")28print(titanic_clean.head())
Split Data into Training and Test Sets
We'll split the data into 80% training and 20% test sets to evaluate model performance.
In [6]:
1# Split into train (80%) and test (20%) sets2train_df, test_df = train_test_split(3 titanic_clean,4 test_size=0.2,5 random_state=42,6 stratify=titanic_clean['survival_status']7)8 9print(f"Training set size: {len(train_df)} passengers")10print(f"Test set size: {len(test_df)} passengers")11print(f"\nTraining set survival distribution:")12print(train_df['survival_status'].value_counts())13print(f"\nTest set survival distribution:")14print(test_df['survival_status'].value_counts())
In [7]:
1%%sql2DROP TABLE IF EXISTS titanic_training_data;3DROP TABLE IF EXISTS titanic_test_data;4DROP TABLE IF EXISTS titanic_predictions;5 6CREATE TABLE titanic_training_data (7 pclass INT,8 sex VARCHAR(10),9 age FLOAT,10 sibsp INT,11 parch INT,12 fare FLOAT,13 embarked VARCHAR(1),14 survival_status VARCHAR(10)15);16 17CREATE TABLE titanic_test_data (18 pclass INT,19 sex VARCHAR(10),20 age FLOAT,21 sibsp INT,22 parch INT,23 fare FLOAT,24 embarked VARCHAR(1),25 survival_status VARCHAR(10)26);27 28CREATE TABLE titanic_predictions (29 pclass INT,30 sex VARCHAR(10),31 age FLOAT,32 sibsp INT,33 parch INT,34 fare FLOAT,35 embarked VARCHAR(1),36 actual_status VARCHAR(10),37 predicted_status JSON38);
Load Data into SingleStore Tables
We'll use pandas to insert the training and test data into our SingleStore tables.
In [8]:
1import singlestoredb as s22 3# Create engine with database specified4engine = s2.create_engine(database='temp')5 6# Insert training data7train_df.to_sql(8 'titanic_training_data',9 con=engine,10 if_exists='append',11 index=False,12 method='multi'13)14 15# Insert test data16test_df.to_sql(17 'titanic_test_data',18 con=engine,19 if_exists='append',20 index=False,21 method='multi'22)23 24print(f"Inserted {len(train_df)} rows into titanic_training_data")25print(f"Inserted {len(test_df)} rows into titanic_test_data")
Verify Data Load
Let's verify that our data was loaded correctly and review the passenger demographics.
In [9]:
1%%sql2SELECT COUNT(*) as training_count FROM titanic_training_data;
In [10]:
1%%sql2SELECT COUNT(*) as test_count FROM titanic_test_data;
In [11]:
1%%sql2SELECT3 survival_status,4 COUNT(*) as passenger_count,5 ROUND(AVG(age), 1) as avg_age,6 ROUND(AVG(fare), 2) as avg_fare7FROM titanic_training_data8GROUP BY survival_status;9SELECT * FROM titanic_training_data LIMIT 5;
Train the ML Classification Model
Now we'll train an ML model using the %s2ml train magic command. This will use SingleStore's ML Functions to train a classification model that predicts passenger survival.
Note: Training may take several minutes depending on the compute size selected. The model will learn patterns like "women and children first" and the impact of ticket class on survival.
In [12]:
1%%s2ml train as training_result2task: classification3model: titanic_survival_predictor4db: temp5input_table: titanic_training_data6target_column: survival_status7description: "Titanic passenger survival prediction based on demographics and ticket info"8runtime: cpu-small9selected_features: {"mode":"*","features":null}
Check Training Results
The training result is assigned to the variable training_result. Let's examine the training details.
In [13]:
1# Display the training result2training_result
Monitor Training Status
Use the %s2ml status command to view the model details and training status. The status will be one of: Pre-processing, Training, Completed, or Error.
In [14]:
1%s2ml status --model titanic_survival_predictor
Note
Wait for training to complete before proceeding to the next section
You can re-run the cell above to check the status. Once the pipeline_status shows "Ready", you can proceed with predictions.
Run Sample Predictions
Once training is complete, let's run predictions on a few sample passengers from our test dataset to see how the model performs. Ensure that you still are using the right database selected
In [15]:
1%%sql2Use temp;
In [16]:
1%%sql2SELECT3 cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status,4 passenger.survival_status as actual_status,5 passenger.pclass as ticket_class,6 passenger.sex,7 passenger.age,8 passenger.fare9FROM (SELECT * FROM titanic_test_data LIMIT 10) AS passenger;
Run Predictions on Full Test Dataset
Now let's run predictions on the entire test dataset and store the results in our predictions table.
In [17]:
1%%sql2INSERT INTO titanic_predictions (3 pclass, sex, age, sibsp, parch, fare, embarked,4 actual_status, predicted_status5)6SELECT7 passenger.pclass,8 passenger.sex,9 passenger.age,10 passenger.sibsp,11 passenger.parch,12 passenger.fare,13 passenger.embarked,14 passenger.survival_status as actual_status,15 cluster.ML_CLASSIFY('titanic_survival_predictor', TO_JSON(passenger.*)) as predicted_status16FROM titanic_test_data AS passenger;
Evaluate Model Performance
Let's analyze the prediction accuracy by comparing actual vs predicted survival status.
In [18]:
1%%sql2SELECT3 COUNT(*) as total_predictions,4 SUM(CASE WHEN actual_status = JSON_EXTRACT_STRING(predicted_status, 'predicted_label') THEN 1 ELSE 0 END) as correct_predictions,5 ROUND(100.0 * SUM(CASE WHEN actual_status = JSON_EXTRACT_STRING(predicted_status, 'predicted_label') THEN 1 ELSE 0 END) / COUNT(*), 2) as accuracy_percentage6FROM titanic_predictions;
Analyze Survival Factors
Let's examine how different passenger characteristics influenced survival predictions.
In [19]:
1%%sql2-- Survival rate by sex3SELECT4 sex,5 COUNT(*) as total_passengers,6 SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) as actual_survivors,7 ROUND(100.0 * SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) / COUNT(*), 1) as survival_rate_pct8FROM titanic_predictions9GROUP BY sex10ORDER BY survival_rate_pct DESC;
In [20]:
1%%sql2-- Survival rate by passenger class3SELECT4 pclass as ticket_class,5 COUNT(*) as total_passengers,6 SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) as actual_survivors,7 ROUND(100.0 * SUM(CASE WHEN actual_status = 'Survived' THEN 1 ELSE 0 END) / COUNT(*), 1) as survival_rate_pct,8 ROUND(AVG(fare), 2) as avg_fare_paid9FROM titanic_predictions10GROUP BY pclass11ORDER BY pclass;
Examine Misclassified Passengers
Let's look at passengers where the model made incorrect predictions to understand potential model limitations.
In [21]:
1%%sql2SELECT3 actual_status,4 JSON_EXTRACT_STRING(predicted_status, 'predicted_label') as predicted_label,5 JSON_EXTRACT_DOUBLE(predicted_status, 'confidence') as confidence,6 pclass as ticket_class,7 sex,8 age,9 sibsp as siblings_spouses,10 parch as parents_children,11 fare,12 embarked13FROM titanic_predictions14WHERE actual_status != JSON_EXTRACT_STRING(predicted_status, 'predicted_label')15LIMIT 15;
Cleanup
In [22]:
1%%sql2DROP TABLE IF EXISTS titanic_training_data;3DROP TABLE IF EXISTS titanic_test_data;4DROP TABLE IF EXISTS titanic_predictions;5DROP DATABASE IF EXISTS temp;

Details
About this Template
Learn how to train an ML Classify model and run it to predict the class of an input row.
This Notebook can be run in Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.
See Notebook in action
Launch this notebook in SingleStore and start executing queries instantly.