
Learn how to Optimize Performance with TPCH 100
Notebook

Note
This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace
Context
This notebook will help you with four core key principles for getting performance out of SingleStoreDB using TPCH Benchmark. SingleStoreDB is a distributed database, so you should think of using shard keys, database partitions, primary keys and indexes for getting the best performance out of it.
About database partitions
The generalized recommendation for most clusters is to have 4 CPU cores per database partition on each leaf. This means if you had a cluster with 16 cores on each of 4 leaves (64 CPU cores total across all leaf hosts), you would want to have 4 partitions on each leaf (16 partitions throughout the cluster). If you are using a S00 workspace, you will have 2 partitions per database. Note that increasing partitions will have additional memory and caching overheads, which can be expensive if you have thousands of tables
About shard keys
Data is distributed across the SingleStoreDB Cloud workspace into a number of partitions on the leaf nodes. The shard key is a collection of the columns in a table that are used to control how the rows of that table are distributed. To determine the partition responsible for a given row, SingleStoreDB Cloud computes a hash from all the columns in the shard key to the partition ID. Therefore, rows with the same shard key will reside on the same partition.
About hash indexes
They are highly efficient for exact-match lookups (point-reads). Because hash indexes store rows in a sparse array of buckets indexed through a hash function on the relevant columns, queries can quickly retrieve data by examining only the corresponding bucket rather than searching the entire dataset. This enables significant reduction in lookup time and hence, increased performance for specific query types.
For that tutorial, we recommend using a workspace of size S4 to ingest data faster and also see the difference and gain you can get from a distributed architecture.
Note
For that tutorial, we recommend using workspace of size S4 to ingest data faster and also see the difference and gain you can get from a distributed architecture.
Let's first create the unoptimized database
In [1]:
1
%%sql2
CREATE DATABASE IF NOT EXISTS s2_tpch_unoptimized3
4
# To create a database with custom partitions use the following syntax: CREATE DATABASE YourDatabaseName PARTITIONS=X;5
# You cannot change after creation the number of partitions
If using a S00, the database will have 2 partitions, if using S1, it will have 8 partitions
In [2]:
1
%%sql2
SELECT num_partitions FROM information_schema.DISTRIBUTED_DATABASES WHERE database_name = 's2_tpch_unoptimized';
Let's create all the tables in that database with no index, shard key or primary key
Action Required
Make sure to select the s2_tpch_unoptimized database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
In [3]:
1
%%sql2
CREATE TABLE IF NOT EXISTS `customer` (3
`c_custkey` int(11) NOT NULL,4
`c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,5
`c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,6
`c_nationkey` int(11) NOT NULL,7
`c_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,8
`c_acctbal` decimal(15,2) NOT NULL,9
`c_mktsegment` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,10
`c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL11
);12
13
CREATE TABLE IF NOT EXISTS `lineitem` (14
`l_orderkey` bigint(11) NOT NULL,15
`l_partkey` int(11) NOT NULL,16
`l_suppkey` int(11) NOT NULL,17
`l_linenumber` int(11) NOT NULL,18
`l_quantity` decimal(15,2) NOT NULL,19
`l_extendedprice` decimal(15,2) NOT NULL,20
`l_discount` decimal(15,2) NOT NULL,21
`l_tax` decimal(15,2) NOT NULL,22
`l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,23
`l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,24
`l_shipdate` date NOT NULL,25
`l_commitdate` date NOT NULL,26
`l_receiptdate` date NOT NULL,27
`l_shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,28
`l_shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,29
`l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL30
);31
32
CREATE TABLE IF NOT EXISTS `nation` (33
`n_nationkey` int(11) NOT NULL,34
`n_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,35
`n_regionkey` int(11) NOT NULL,36
`n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL37
);38
39
CREATE TABLE IF NOT EXISTS `orders` (40
`o_orderkey` bigint(11) NOT NULL,41
`o_custkey` int(11) NOT NULL,42
`o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,43
`o_totalprice` decimal(15,2) NOT NULL,44
`o_orderdate` date NOT NULL,45
`o_orderpriority` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,46
`o_clerk` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,47
`o_shippriority` int(11) NOT NULL,48
`o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL49
);50
51
CREATE TABLE IF NOT EXISTS `part` (52
`p_partkey` int(11) NOT NULL,53
`p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,54
`p_mfgr` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,55
`p_brand` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,56
`p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,57
`p_size` int(11) NOT NULL,58
`p_container` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,59
`p_retailprice` decimal(15,2) NOT NULL,60
`p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL61
);62
63
CREATE TABLE IF NOT EXISTS `partsupp` (64
`ps_partkey` int(11) NOT NULL,65
`ps_suppkey` int(11) NOT NULL,66
`ps_availqty` int(11) NOT NULL,67
`ps_supplycost` decimal(15,2) NOT NULL,68
`ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL69
);70
71
CREATE TABLE IF NOT EXISTS `region` (72
`r_regionkey` int(11) NOT NULL,73
`r_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,74
`r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL75
);76
77
CREATE TABLE IF NOT EXISTS `supplier` (78
`s_suppkey` int(11) NOT NULL,79
`s_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,80
`s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,81
`s_nationkey` int(11) NOT NULL,82
`s_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,83
`s_acctbal` decimal(15,2) NOT NULL,84
`s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL85
);
Now let's create the pipelines and run them to ingest data
In [4]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS `customer_pipeline`3
AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/customer'4
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5
BATCH_INTERVAL 25006
DISABLE OUT_OF_ORDER OPTIMIZATION7
DISABLE OFFSETS METADATA GC8
SKIP DUPLICATE KEY ERRORS9
INTO TABLE `customer`10
FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11
LINES TERMINATED BY '|\n' STARTING BY '';
In [5]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS `lineitem_pipeline`3
AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/lineitem/lineitem.'4
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5
BATCH_INTERVAL 25006
DISABLE OUT_OF_ORDER OPTIMIZATION7
DISABLE OFFSETS METADATA GC8
SKIP DUPLICATE KEY ERRORS9
INTO TABLE `lineitem`10
FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11
LINES TERMINATED BY '|\n' STARTING BY '';
In [6]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS `nation_pipeline`3
AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/nation'4
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5
BATCH_INTERVAL 25006
DISABLE OUT_OF_ORDER OPTIMIZATION7
DISABLE OFFSETS METADATA GC8
SKIP DUPLICATE KEY ERRORS9
INTO TABLE `nation`10
FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11
LINES TERMINATED BY '|\n' STARTING BY '';
In [7]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS `orders_pipeline`3
AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/orders'4
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5
BATCH_INTERVAL 25006
DISABLE OUT_OF_ORDER OPTIMIZATION7
DISABLE OFFSETS METADATA GC8
SKIP DUPLICATE KEY ERRORS9
INTO TABLE `orders`10
FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11
LINES TERMINATED BY '|\n' STARTING BY '';
In [8]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS `partsupp_pipeline`3
AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/partsupp'4
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5
BATCH_INTERVAL 25006
DISABLE OUT_OF_ORDER OPTIMIZATION7
DISABLE OFFSETS METADATA GC8
SKIP DUPLICATE KEY ERRORS9
INTO TABLE `partsupp`10
FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11
LINES TERMINATED BY '|\n' STARTING BY '';
In [9]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS `part_pipeline`3
AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/part'4
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5
BATCH_INTERVAL 25006
DISABLE OUT_OF_ORDER OPTIMIZATION7
DISABLE OFFSETS METADATA GC8
SKIP DUPLICATE KEY ERRORS9
INTO TABLE `part`10
FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11
LINES TERMINATED BY '|\n' STARTING BY '';
In [10]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS `region_pipeline`3
AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/region'4
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5
BATCH_INTERVAL 25006
DISABLE OUT_OF_ORDER OPTIMIZATION7
DISABLE OFFSETS METADATA GC8
SKIP DUPLICATE KEY ERRORS9
INTO TABLE `region`10
FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11
LINES TERMINATED BY '|\n' STARTING BY '';
In [11]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS `supplier_pipeline`3
AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/supplier'4
CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5
BATCH_INTERVAL 25006
DISABLE OUT_OF_ORDER OPTIMIZATION7
DISABLE OFFSETS METADATA GC8
SKIP DUPLICATE KEY ERRORS9
INTO TABLE `supplier`10
FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11
LINES TERMINATED BY '|\n' STARTING BY '';
In [12]:
1
%%sql2
START PIPELINE customer_pipeline;3
START PIPELINE lineitem_pipeline;4
START PIPELINE nation_pipeline;5
START PIPELINE orders_pipeline;6
START PIPELINE partsupp_pipeline;7
START PIPELINE part_pipeline;8
START PIPELINE region_pipeline;9
START PIPELINE supplier_pipeline;
[Optional Step] Check data ingestion in real-time with Perspective
In [13]:
1
import perspective2
import threading3
import random4
import time5
import warnings6
from datetime import datetime, date7
from perspective import Table, PerspectiveWidget8
warnings.filterwarnings('ignore')
In [14]:
1
def loop():2
while mode != 'stop':3
while mode == 'run':4
table.update(data_source())5
time.sleep(1)
In [15]:
1
def data_source():2
result = %sql SELECT sum(rows_streamed) AS rows_streamed FROM information_schema.pipelines_batches_summary WHERE database_name = 's2_tpch_unoptimized';3
result2 = list(result.dicts())4
return result25
6
SCHEMA = {7
"rows_streamed": int8
}
In [16]:
1
mode = 'run'2
table = perspective.Table(SCHEMA, limit=100)3
threading.Thread(target=loop).start()
In [17]:
1
perspective.PerspectiveWidget(table,title = "Track Row Ingestion",plugin="Y Line",columns=["count_rows"])
In [18]:
1
mode = 'stop'
Now, let's see the performance of a few queries
In [19]:
1
%%sql2
# TPC-H Query 1: Pricing Summary Report3
SELECT4
l_returnflag,5
l_linestatus,6
SUM(l_quantity) AS sum_qty,7
SUM(l_extendedprice) AS sum_base_price,8
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,9
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,10
AVG(l_quantity) AS avg_qty, AVG(l_extendedprice) AS avg_price,11
AVG(l_discount) AS avg_disc,12
COUNT(*) AS count_order13
FROM s2_tpch_unoptimized.lineitem14
WHERE l_shipdate <= DATE('1998-12-01') - INTERVAL '90' DAY15
GROUP BY l_returnflag, l_linestatus16
ORDER BY l_returnflag, l_linestatus;
In [20]:
1
%%sql2
# TPC-H Query 4: Order Priority Checking3
SELECT4
o_orderpriority,5
COUNT(*) AS order_count6
FROM7
s2_tpch_unoptimized.orders8
WHERE9
o_orderdate >= DATE('1993-07-01')10
AND o_orderdate < DATE('1993-10-01')11
AND EXISTS (12
SELECT *13
FROM s2_tpch_unoptimized.lineitem14
WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate15
)16
GROUP BY o_orderpriority17
ORDER BY o_orderpriority;
In [21]:
1
%%sql2
# TPC-H Query 21: Suppliers Who Kept Orders Waiting3
SELECT4
s_name,5
COUNT(*) AS numwait6
FROM7
s2_tpch_unoptimized.supplier,8
s2_tpch_unoptimized.lineitem l1,9
s2_tpch_unoptimized.orders,10
s2_tpch_unoptimized.nation11
WHERE12
s_suppkey = l1.l_suppkey13
AND o_orderkey = l1.l_orderkey14
AND o_orderstatus = 'F'15
AND l1.l_receiptdate > l1.l_commitdate16
AND EXISTS (17
SELECT18
*19
FROM20
s2_tpch_unoptimized.lineitem l221
WHERE22
l2.l_orderkey = l1.l_orderkey23
AND l2.l_suppkey <> l1.l_suppkey24
)25
AND NOT EXISTS (26
SELECT27
*28
FROM29
s2_tpch_unoptimized.lineitem l330
WHERE31
l3.l_orderkey = l1.l_orderkey32
AND l3.l_suppkey <> l1.l_suppkey33
AND l3.l_receiptdate > l3.l_commitdate34
)35
AND s_nationkey = n_nationkey36
AND n_name = 'EGYPT'37
GROUP BY38
s_name39
ORDER BY40
numwait DESC,41
s_name42
LIMIT 100;
Now, let's first focus on optimizing the performance
In [22]:
1
%%sql2
CREATE DATABASE IF NOT EXISTS s2_tpch_optimized
Action Required
Make sure to select the s2_tpch_optimized database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Now, let's create each table with optimized data structure:
We create a unique key through primary key. For example lineitem table needs both the orderkey and linenumber to identify rows by uniqueness
We create a shard key which will distribute data in an efficient way to perform fast join and filtering. For lineitem table since we perform joins and calculation based on the orderkey we create a shardkey with orderkey
In [23]:
1
%%sql2
CREATE TABLE IF NOT EXISTS `lineitem` (3
`l_orderkey` bigint(11) NOT NULL,4
`l_partkey` int(11) NOT NULL,5
`l_suppkey` int(11) NOT NULL,6
`l_linenumber` int(11) NOT NULL,7
`l_quantity` decimal(15,2) NOT NULL,8
`l_extendedprice` decimal(15,2) NOT NULL,9
`l_discount` decimal(15,2) NOT NULL,10
`l_tax` decimal(15,2) NOT NULL,11
`l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,12
`l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,13
`l_shipdate` date NOT NULL,14
`l_commitdate` date NOT NULL,15
`l_receiptdate` date NOT NULL,16
`l_shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,17
`l_shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,18
`l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,19
UNIQUE KEY `pk` (`l_orderkey`,`l_linenumber`) USING HASH,20
SHARD KEY `__SHARDKEY` (`l_orderkey`),21
KEY `l_orderkey` (`l_orderkey`) USING CLUSTERED COLUMNSTORE22
);23
24
CREATE TABLE IF NOT EXISTS `customer` (25
`c_custkey` int(11) NOT NULL,26
`c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,27
`c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,28
`c_nationkey` int(11) NOT NULL,29
`c_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,30
`c_acctbal` decimal(15,2) NOT NULL,31
`c_mktsegment` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,32
`c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,33
UNIQUE KEY `pk` (`c_custkey`) USING HASH,34
SHARD KEY `__SHARDKEY` (`c_custkey`),35
KEY `c_custkey` (`c_custkey`) USING CLUSTERED COLUMNSTORE36
);37
38
CREATE TABLE IF NOT EXISTS `nation` (39
`n_nationkey` int(11) NOT NULL,40
`n_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,41
`n_regionkey` int(11) NOT NULL,42
`n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,43
UNIQUE KEY `pk` (`n_nationkey`) USING HASH,44
SHARD KEY `__SHARDKEY` (`n_nationkey`),45
KEY `n_nationkey` (`n_nationkey`) USING CLUSTERED COLUMNSTORE46
);47
48
CREATE TABLE IF NOT EXISTS `orders` (49
`o_orderkey` bigint(11) NOT NULL,50
`o_custkey` int(11) NOT NULL,51
`o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,52
`o_totalprice` decimal(15,2) NOT NULL,53
`o_orderdate` date NOT NULL,54
`o_orderpriority` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,55
`o_clerk` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,56
`o_shippriority` int(11) NOT NULL,57
`o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,58
UNIQUE KEY `pk` (`o_orderkey`) USING HASH,59
SHARD KEY `__SHARDKEY` (`o_orderkey`),60
KEY `o_orderkey` (`o_orderkey`) USING CLUSTERED COLUMNSTORE61
);62
63
CREATE TABLE IF NOT EXISTS `part` (64
`p_partkey` int(11) NOT NULL,65
`p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,66
`p_mfgr` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,67
`p_brand` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,68
`p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,69
`p_size` int(11) NOT NULL,70
`p_container` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,71
`p_retailprice` decimal(15,2) NOT NULL,72
`p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,73
UNIQUE KEY `pk` (`p_partkey`) USING HASH,74
SHARD KEY `__SHARDKEY` (`p_partkey`),75
KEY `p_partkey` (`p_partkey`) USING CLUSTERED COLUMNSTORE76
);77
78
CREATE TABLE IF NOT EXISTS `partsupp` (79
`ps_partkey` int(11) NOT NULL,80
`ps_suppkey` int(11) NOT NULL,81
`ps_availqty` int(11) NOT NULL,82
`ps_supplycost` decimal(15,2) NOT NULL,83
`ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,84
UNIQUE KEY `pk` (`ps_partkey`,`ps_suppkey`) USING HASH,85
SHARD KEY `__SHARDKEY` (`ps_partkey`),86
KEY `ps_partkey` (`ps_partkey`,`ps_suppkey`) USING CLUSTERED COLUMNSTORE87
);88
89
CREATE TABLE IF NOT EXISTS `region` (90
`r_regionkey` int(11) NOT NULL,91
`r_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,92
`r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,93
UNIQUE KEY `pk` (`r_regionkey`) USING HASH,94
SHARD KEY `__SHARDKEY` (`r_regionkey`),95
KEY `r_regionkey` (`r_regionkey`) USING CLUSTERED COLUMNSTORE96
);97
98
CREATE TABLE IF NOT EXISTS `supplier` (99
`s_suppkey` int(11) NOT NULL,100
`s_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,101
`s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,102
`s_nationkey` int(11) NOT NULL,103
`s_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,104
`s_acctbal` decimal(15,2) NOT NULL,105
`s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,106
UNIQUE KEY `pk` (`s_suppkey`) USING HASH,107
SHARD KEY `__SHARDKEY` (`s_suppkey`),108
KEY `s_suppkey` (`s_suppkey`) USING CLUSTERED COLUMNSTORE109
);
In [24]:
1
%%sql2
INSERT INTO s2_tpch_optimized.nation SELECT * FROM s2_tpch_unoptimized.nation;3
INSERT INTO s2_tpch_optimized.lineitem SELECT * FROM s2_tpch_unoptimized.lineitem;4
INSERT INTO s2_tpch_optimized.customer SELECT * FROM s2_tpch_unoptimized.customer;5
INSERT INTO s2_tpch_optimized.orders SELECT * FROM s2_tpch_unoptimized.orders;6
INSERT INTO s2_tpch_optimized.part SELECT * FROM s2_tpch_unoptimized.part;7
INSERT INTO s2_tpch_optimized.partsupp SELECT * FROM s2_tpch_unoptimized.partsupp;8
INSERT INTO s2_tpch_optimized.region SELECT * FROM s2_tpch_unoptimized.region;9
INSERT INTO s2_tpch_optimized.supplier SELECT * FROM s2_tpch_unoptimized.supplier;
In [25]:
1
%%sql2
# TPC-H Query 1: Pricing Summary Report3
SELECT4
l_returnflag,5
l_linestatus,6
SUM(l_quantity) AS sum_qty,7
SUM(l_extendedprice) AS sum_base_price,8
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,9
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,10
AVG(l_quantity) AS avg_qty, AVG(l_extendedprice) AS avg_price,11
AVG(l_discount) AS avg_disc,12
COUNT(*) AS count_order13
FROM lineitem14
WHERE l_shipdate <= DATE('1998-12-01') - INTERVAL '90' DAY15
GROUP BY l_returnflag, l_linestatus16
ORDER BY l_returnflag, l_linestatus;
In [26]:
1
%%sql2
# TPC-H Query 4: Order Priority Checking3
SELECT4
o_orderpriority,5
COUNT(*) AS order_count6
FROM7
s2_tpch_optimized.orders8
WHERE9
o_orderdate >= DATE('1993-07-01')10
AND o_orderdate < DATE('1993-10-01')11
AND EXISTS (12
SELECT *13
FROM s2_tpch_optimized.lineitem14
WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate15
)16
GROUP BY o_orderpriority17
ORDER BY o_orderpriority;
In [27]:
1
%%sql2
# TPC-H Query 21: Suppliers Who Kept Orders Waiting3
SELECT4
s_name,5
COUNT(*) AS numwait6
FROM7
s2_tpch_optimized.supplier,8
s2_tpch_optimized.lineitem l1,9
s2_tpch_optimized.orders,10
s2_tpch_optimized.nation11
WHERE12
s_suppkey = l1.l_suppkey13
AND o_orderkey = l1.l_orderkey14
AND o_orderstatus = 'F'15
AND l1.l_receiptdate > l1.l_commitdate16
AND EXISTS (17
SELECT18
*19
FROM20
s2_tpch_optimized.lineitem l221
WHERE22
l2.l_orderkey = l1.l_orderkey23
AND l2.l_suppkey <> l1.l_suppkey24
)25
AND NOT EXISTS (26
SELECT27
*28
FROM29
s2_tpch_optimized.lineitem l330
WHERE31
l3.l_orderkey = l1.l_orderkey32
AND l3.l_suppkey <> l1.l_suppkey33
AND l3.l_receiptdate > l3.l_commitdate34
)35
AND s_nationkey = n_nationkey36
AND n_name = 'EGYPT'37
GROUP BY38
s_name39
ORDER BY40
numwait desc,41
s_name42
LIMIT 100;
Finally, let's do a side by side comparison between the optimized and unoptimized database
In [28]:
1
from singlestoredb import create_engine2
import sqlalchemy as sa3
4
db_connection_unoptimized = create_engine(database='s2_tpch_unoptimized').connect()5
db_connection_optimized = create_engine(database='s2_tpch_optimized').connect()
Here are a few queries that you can test side by side against. Overall you will notice an average of 4x improvement in performance
In [29]:
1
sql_query4 = sa.text('''2
SELECT3
o_orderpriority,4
COUNT(*) AS order_count5
FROM6
s2_tpch_unoptimized.orders7
WHERE8
o_orderdate >= DATE('1993-07-01')9
AND o_orderdate < DATE('1993-10-01')10
AND EXISTS (11
SELECT *12
FROM s2_tpch_unoptimized.lineitem13
WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate14
)15
GROUP BY o_orderpriority16
ORDER BY o_orderpriority;17
''')
In [30]:
1
sql_query21 = sa.text('''2
SELECT3
s_name,4
COUNT(*) AS numwait5
FROM6
supplier,7
lineitem l1,8
orders,9
nation10
WHERE11
s_suppkey = l1.l_suppkey12
AND o_orderkey = l1.l_orderkey13
AND o_orderstatus = 'F'14
AND l1.l_receiptdate > l1.l_commitdate15
AND EXISTS (16
SELECT17
*18
FROM19
lineitem l220
WHERE21
l2.l_orderkey = l1.l_orderkey22
AND l2.l_suppkey <> l1.l_suppkey23
)24
AND NOT EXISTS (25
SELECT26
*27
FROM28
lineitem l329
WHERE30
l3.l_orderkey = l1.l_orderkey31
AND l3.l_suppkey <> l1.l_suppkey32
AND l3.l_receiptdate > l3.l_commitdate33
)34
AND s_nationkey = n_nationkey35
AND n_name = 'EGYPT'36
GROUP BY37
s_name38
ORDER BY39
numwait desc,40
s_name41
LIMIT 100;42
''')
In [31]:
1
result = db_connection_optimized.execute(sql_query21)
In [32]:
1
import time2
import pandas as pd3
import plotly.graph_objs as go4
5
num_iterations = 106
opt_times = []7
8
for i in range(num_iterations):9
opt_start_time = time.time()10
opt_result = db_connection_optimized.execute(sql_query21)11
opt_stop_time = time.time()12
opt_times.append(opt_stop_time - opt_start_time)13
14
unopt_times = []15
for i in range(num_iterations):16
unopt_start_time = time.time()17
unopt_result = db_connection_unoptimized.execute(sql_query21)18
unopt_stop_time = time.time()19
unopt_times.append(unopt_stop_time - unopt_start_time)20
21
x_axis = list(range(1, num_iterations + 1))22
data = {23
'iteration': x_axis,24
'opt_times': opt_times,25
'unopt_times': unopt_times,26
}27
df = pd.DataFrame.from_dict(data)28
29
fig = go.Figure()30
31
# Adding optimized times to the plot32
fig.add_trace(go.Scatter(x=df['iteration'], y=df['opt_times'], mode='lines+markers', name='Optimized Database'))33
34
# Adding unoptimized times to the plot35
fig.add_trace(go.Scatter(x=df['iteration'], y=df['unopt_times'], mode='lines+markers', name='Unoptimized Database'))36
37
# Update y-axis and x-axis properties38
fig.update_layout(39
title="Execution Time Comparison",40
xaxis_title="Iteration",41
yaxis_title="Time in Seconds",42
xaxis=dict(tickmode='array', tickvals=list(range(1, num_iterations + 1)))43
)44
45
# Show the plot46
fig.show()

Details
About this Template
This notebook will help you understand how you can take advantage of SingleStoreDB distributed capability using TPCH-100.
This Notebook can be run in Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.
See Notebook in action
Launch this notebook in SingleStore and start executing queries instantly.