New

Mongo Atlas & SingleStore Kai

Notebook

SingleStore Notebooks

Mongo Atlas & SingleStore Kai

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

No code change required! 100% MongoDB notebook!

Given the small dataset, the objective of that notebook is not to compare performance.

What you will learn in this notebook:

  1. Install libraries and import modules

  2. Connect to a MongoDB Atlas and SingleStore Kai endpoints

  3. Copy Atlas collections into SingleStore - Synthetic collections are about retail sales transactions with customer information

Compare performance on same code from simple to more complex queries

  1. Document counts

  2. Product quantity sold

  3. Average customer satisfaction

  4. Average satisfaction per product

  5. Number of transactions by Location and membership

  6. Top 10 product sales

1. Install libraries and import modules

Make sure that you have a created MongoDB enabled workspace.

This must be done when creating a workspace (for Standard/Premium Workspaces) For Starter Workspaces, the KAI API will be on by default.

In [1]:

!pip install pymongo pandas matplotlib plotly ipywidgets --quiet

In [2]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if shared_tier_check and shared_tier_check[0][1] == 'ON':
current_database = %sql SELECT DATABASE() as CurrentDatabase
database_to_use = current_database[0][0]
else:
database_to_use = "new_transactions"
%sql CREATE DATABASE {{database_to_use}}

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [3]:

import os
import time
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.subplots as sp
import pymongo
from pymongo import MongoClient
from plotly.offline import plot, iplot, init_notebook_mode

2. Connect to Atlas and SingleStore Kai endpoints

We are using a shared tier on the backend for Atlas

In [4]:

# No need to edit anything
myclientmongodb = pymongo.MongoClient("mongodb+srv://mongo_sample_reader:SingleStoreRocks27017@cluster1.tfutgo0.mongodb.net/?retryWrites=true&w=majority")
mydbmongodb = myclientmongodb["new_transactions"]
mongoitems = mydbmongodb["items"]
mongocusts = mydbmongodb["custs"]
mongotxs = mydbmongodb["txs"]

Select the workspace that you want to use.

In [5]:

s2clientmongodb = pymongo.MongoClient(connection_url_kai)
s2dbmongodb = s2clientmongodb[database_to_use]
s2mongoitems = s2dbmongodb["items"]
s2mongocusts = s2dbmongodb["custs"]
s2mongotxs = s2dbmongodb["txs"]

3. Copy Atlas collections into SingleStore Kai

In [6]:

mongocollections = [mongoitems, mongocusts, mongotxs]
for mongo_collection in mongocollections:
df = pd.DataFrame(list(mongo_collection.find())).reset_index(drop=True)
data_dict = df.to_dict(orient='records')
s2mongo_collection = s2dbmongodb[mongo_collection.name]
s2mongo_collection.insert_many(data_dict)

Count documents in SingleStore

In [7]:

mg_count = s2mongoitems.count_documents({})
mg_count

Compare Queries and Performance

In-app analytics is everywhere.

4. Document counts

In [8]:

num_iterations = 10
mongo_times = []
for i in range(num_iterations):
mg_start_time = time.time()
mg_count = mongoitems.count_documents({})
mg_stop_time = time.time()
mongo_times.append(mg_stop_time - mg_start_time)
s2_times = []
for i in range(num_iterations):
s2_start_time = time.time()
s2_count = s2mongoitems.count_documents({})
s2_stop_time = time.time()
s2_times.append(s2_stop_time - s2_start_time)
df = pd.DataFrame.from_dict({
'iteration': list(range(1, num_iterations + 1)),
'mongo_times': mongo_times,
's2_times': s2_times,
})
df_2 = pd.DataFrame.from_dict({
'counts': [mg_count, s2_count],
'connection_type': ["mongodb", "singlestore"],
})
figures = [
px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),
px.bar(df_2, x="connection_type", y="counts", color="connection_type")
]
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Document Count Execution Time","Document Counts"])
for i, figure in enumerate(figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=1, col=i + 1)
fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5,6,7,8,9,10], row=1, col=1)
fig

5. Product Quantity Sold

In [9]:

num_iterations = 10
mongo_times = []
pipeline = [
{"$group": {"_id": "$item.name", "totalQuantity": {"$sum": "$item.quantity"}}},
{"$sort": {"totalQuantity": -1}},
{"$limit": 5},
]
for i in range(num_iterations):
mg_start_time = time.time()
mg_result = mongoitems.aggregate(pipeline)
mg_stop_time = time.time()
mongo_times.append(mg_stop_time - mg_start_time)
s2_times = []
for i in range(num_iterations):
s2_start_time = time.time()
s2_result = s2mongoitems.aggregate(pipeline)
s2_stop_time = time.time()
s2_times.append(s2_stop_time - s2_start_time)
x_axis = list(range(1, num_iterations + 1))
data = {
'iteration': x_axis,
'mongo_times': mongo_times,
's2_times': s2_times,
}
df = pd.DataFrame.from_dict(data)
item_names = []
item_quantity = []
for i in mg_result:
item_names.append(i["_id"])
item_quantity.append(i["totalQuantity"])
figures = [
px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),
px.bar(x=item_names, y=item_quantity)
]
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Comparison of Product Quantity Sold"])
for i, figure in enumerate(figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=1, col=i+1)
fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)
fig

6. Average Customer Satisfaction

In [10]:

num_iterations = 10
mongo_times = []
pipeline = [
{'$group':
{
'_id': '$customer.email',
'average_satisfaction': {'$avg': '$customer.satisfaction'},
},
},
{'$limit': 10},
]
for i in range(num_iterations):
mg_start_time = time.time()
mg_result = mongotxs.aggregate(pipeline)
mg_stop_time = time.time()
mongo_times.append(mg_stop_time - mg_start_time)
s2_times = []
for i in range(num_iterations):
s2_start_time = time.time()
s2_result = s2mongotxs.aggregate(pipeline)
s2_stop_time = time.time()
s2_times.append(s2_stop_time - s2_start_time)
x_axis = list(range(1, num_iterations + 1))
data = {
'iteration': x_axis,
'mongo_times': mongo_times,
's2_times': s2_times,
}
df = pd.DataFrame.from_dict(data)
item_names = []
item_quantity = []
for i in mg_result:
item_names.append(i["_id"])
item_quantity.append(i["average_satisfaction"])
figures = [
px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),
px.bar(x=item_names, y=item_quantity)
]
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Average Customer Satisfaction"])
for i, figure in enumerate(figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=1, col=i+1)
fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)
fig

7. Average Satisfaction per Product

In [11]:

num_iterations = 10
mongo_times = []
pipeline = [
{"$lookup":
{
"from": "txs",
"localField": "tx_id",
"foreignField": "transaction_id",
"as": "transaction_links",
}
},
{"$limit": 10 },
{"$unwind": "$transaction_links"},
{"$group":
{
"_id": {"item": "$item.name"},
"Average Satisfaction": {"$avg": "$transaction_links.customer.satisfaction"}
}
}
]
for i in range(num_iterations):
mg_start_time = time.time()
mg_result = mongoitems.aggregate(pipeline)
mg_stop_time = time.time()
mongo_times.append(mg_stop_time - mg_start_time)
s2_times = []
for i in range(num_iterations):
s2_start_time = time.time()
s2_result = s2mongoitems.aggregate(pipeline)
s2_stop_time = time.time()
s2_times.append(s2_stop_time - s2_start_time)
x_axis = list(range(1, num_iterations + 1))
data = {
'iteration': x_axis,
'mongo_times': mongo_times,
's2_times': s2_times,
}
df = pd.DataFrame.from_dict(data)
item_names = []
item_quantity = []
for i in mg_result:
item_names.append(i["_id"]['item'])
item_quantity.append(i["Average Satisfaction"])
figures = [
px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),
px.bar(x=item_names, y=item_quantity)
]
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Average Satisfaction per Product"])
for i, figure in enumerate(figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=1, col=i+1)
fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)
fig

8. Number of transactions by location and membership

In [12]:

num_iterations = 10
mongo_times = []
pipeline = [
{"$lookup":
{
"from": "custs",
"localField": "customer.email",
"foreignField": "email",
"as": "transaction_links",
}
},
{"$limit": 100},
{"$group":
{
"_id": {
"location": "$store_location",
"membership": "$transaction_links.membership"
},
"count": {"$sum": 1}
}
},
{"$sort": {"count":-1}}
]
for i in range (num_iterations):
mg_start_time = time.time()
mg_result = mongotxs.aggregate(pipeline)
mg_stop_time = time.time()
mongo_times.append(mg_stop_time - mg_start_time)
s2_times = []
for i in range (num_iterations):
s2_start_time = time.time()
s2_result = s2mongotxs.aggregate(pipeline)
s2_stop_time = time.time()
s2_times.append(s2_stop_time - s2_start_time)
x_axis = list(range(1, num_iterations + 1))
data = {
'iteration': x_axis,
'mongo_times': mongo_times,
's2_times': s2_times,
}
df = pd.DataFrame.from_dict(data)
item_names = []
item_quantity = []
for i in mg_result:
toadd = i["_id"]['location'] + ', ' + i["_id"]['membership'][0]
item_names.append(toadd)
item_quantity.append(i['count'])
figures = [
px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),
px.bar(x=item_names, y=item_quantity)
]
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Sales per Store"])
for i, figure in enumerate(figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=1, col=i+1)
fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)
fig

9. Top 10 Product Sales

In [13]:

num_iterations = 10
mongo_times = []
pipeline = [
{"$project":
{
"item": 1,
"revenue": {"$multiply": ["$item.price", "$item.quantity"]}
}
},
{"$group":
{
"_id": "$item.name",
"total_revenue": {"$sum": "$revenue"}
}
},
{"$sort": {"total_revenue": -1}},
{"$limit": 10},
]
for i in range (num_iterations):
mg_start_time = time.time()
mg_result = mongoitems.aggregate(pipeline)
mg_stop_time = time.time()
mongo_times.append(mg_stop_time - mg_start_time)
s2_times = []
for i in range (num_iterations):
s2_start_time = time.time()
s2_result = s2mongoitems.aggregate(pipeline)
s2_stop_time = time.time()
s2_times.append(s2_stop_time - s2_start_time)
x_axis = [i + 1 for i in range(num_iterations)]
data = {
'iteration': x_axis,
'mongo_times': mongo_times,
's2_times': s2_times,
}
df = pd.DataFrame.from_dict(data)
item_names = []
item_quantity = []
for i, result in enumerate(mg_result):
if i >= 1:
toadd = result["_id"]
item_names.append(toadd)
item_quantity.append(result['total_revenue'])
figures = [
px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']), # Exclude the first iteration from the line chart
px.bar(x=item_names, y=item_quantity)
]
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Top 10 Product Sales"])
for i, figure in enumerate(figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=1, col=i+1)
fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)
fig

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [14]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS new_transactions;

Details

Tags

#starter#mongo#kai

License

This Notebook has been released under the Apache 2.0 open source license.