New

Learn how to Optimize Performance with TPCH 100

Notebook

SingleStore Notebooks

Learn how to Optimize Performance with TPCH 100

Note

This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace

Context

This notebook will help you with four core key principles for getting performance out of SingleStoreDB using TPCH Benchmark. SingleStoreDB is a distributed database, so you should think of using shard keys, database partitions, primary keys and indexes for getting the best performance out of it.

About database partitions

The generalized recommendation for most clusters is to have 4 CPU cores per database partition on each leaf. This means if you had a cluster with 16 cores on each of 4 leaves (64 CPU cores total across all leaf hosts), you would want to have 4 partitions on each leaf (16 partitions throughout the cluster). If you are using a S00 workspace, you will have 2 partitions per database. Note that increasing partitions will have additional memory and caching overheads, which can be expensive if you have thousands of tables

About shard keys

Data is distributed across the SingleStoreDB Cloud workspace into a number of partitions on the leaf nodes. The shard key is a collection of the columns in a table that are used to control how the rows of that table are distributed. To determine the partition responsible for a given row, SingleStoreDB Cloud computes a hash from all the columns in the shard key to the partition ID. Therefore, rows with the same shard key will reside on the same partition.

About hash indexes

They are highly efficient for exact-match lookups (point-reads). Because hash indexes store rows in a sparse array of buckets indexed through a hash function on the relevant columns, queries can quickly retrieve data by examining only the corresponding bucket rather than searching the entire dataset. This enables significant reduction in lookup time and hence, increased performance for specific query types.

For that tutorial, we recommend using a workspace of size S4 to ingest data faster and also see the difference and gain you can get from a distributed architecture.

Note

For that tutorial, we recommend using workspace of size S4 to ingest data faster and also see the difference and gain you can get from a distributed architecture.

Let's first create the unoptimized database

In [1]:

1!pip install -q perspective-python==2.10.1

In [2]:

1%%sql2CREATE DATABASE IF NOT EXISTS s2_tpch_unoptimized3
4# To create a database with custom partitions use the following syntax: CREATE DATABASE YourDatabaseName PARTITIONS=X;5# You cannot change after creation the number of partitions

If using a S00, the database will have 2 partitions, if using S1, it will have 8 partitions

In [3]:

1%%sql2SELECT num_partitions FROM information_schema.DISTRIBUTED_DATABASES WHERE database_name = 's2_tpch_unoptimized';

Let's create all the tables in that database with no index, shard key or primary key

Action Required

Make sure to select the s2_tpch_unoptimized database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [4]:

1%%sql2CREATE TABLE IF NOT EXISTS `customer` (3  `c_custkey` int(11) NOT NULL,4  `c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,5  `c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,6  `c_nationkey` int(11) NOT NULL,7  `c_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,8  `c_acctbal` decimal(15,2) NOT NULL,9  `c_mktsegment` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,10  `c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL11);12
13CREATE TABLE IF NOT EXISTS `lineitem` (14  `l_orderkey` bigint(11) NOT NULL,15  `l_partkey` int(11) NOT NULL,16  `l_suppkey` int(11) NOT NULL,17  `l_linenumber` int(11) NOT NULL,18  `l_quantity` decimal(15,2) NOT NULL,19  `l_extendedprice` decimal(15,2) NOT NULL,20  `l_discount` decimal(15,2) NOT NULL,21  `l_tax` decimal(15,2) NOT NULL,22  `l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,23  `l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,24  `l_shipdate` date NOT NULL,25  `l_commitdate` date NOT NULL,26  `l_receiptdate` date NOT NULL,27  `l_shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,28  `l_shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,29  `l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL30);31
32CREATE TABLE IF NOT EXISTS `nation` (33  `n_nationkey` int(11) NOT NULL,34  `n_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,35  `n_regionkey` int(11) NOT NULL,36  `n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL37);38
39CREATE TABLE IF NOT EXISTS `orders` (40  `o_orderkey` bigint(11) NOT NULL,41  `o_custkey` int(11) NOT NULL,42  `o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,43  `o_totalprice` decimal(15,2) NOT NULL,44  `o_orderdate` date NOT NULL,45  `o_orderpriority` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,46  `o_clerk` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,47  `o_shippriority` int(11) NOT NULL,48  `o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL49);50
51CREATE TABLE IF NOT EXISTS `part` (52  `p_partkey` int(11) NOT NULL,53  `p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,54  `p_mfgr` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,55  `p_brand` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,56  `p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,57  `p_size` int(11) NOT NULL,58  `p_container` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,59  `p_retailprice` decimal(15,2) NOT NULL,60  `p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL61);62
63CREATE TABLE IF NOT EXISTS `partsupp` (64  `ps_partkey` int(11) NOT NULL,65  `ps_suppkey` int(11) NOT NULL,66  `ps_availqty` int(11) NOT NULL,67  `ps_supplycost` decimal(15,2) NOT NULL,68  `ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL69);70
71CREATE TABLE IF NOT EXISTS `region` (72  `r_regionkey` int(11) NOT NULL,73  `r_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,74  `r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL75);76
77CREATE TABLE IF NOT EXISTS `supplier` (78  `s_suppkey` int(11) NOT NULL,79  `s_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,80  `s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,81  `s_nationkey` int(11) NOT NULL,82  `s_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,83  `s_acctbal` decimal(15,2) NOT NULL,84  `s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL85);

Now let's create the pipelines and run them to ingest data

In [5]:

1%%sql2CREATE PIPELINE IF NOT EXISTS `customer_pipeline`3    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/customer'4    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5    BATCH_INTERVAL 25006    DISABLE OUT_OF_ORDER OPTIMIZATION7    DISABLE OFFSETS METADATA GC8    SKIP DUPLICATE KEY ERRORS9    INTO TABLE `customer`10    FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11    LINES TERMINATED BY '|\n' STARTING BY '';

In [6]:

1%%sql2CREATE PIPELINE IF NOT EXISTS `lineitem_pipeline`3    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/lineitem/lineitem.'4    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5    BATCH_INTERVAL 25006    DISABLE OUT_OF_ORDER OPTIMIZATION7    DISABLE OFFSETS METADATA GC8    SKIP DUPLICATE KEY ERRORS9    INTO TABLE `lineitem`10    FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11    LINES TERMINATED BY '|\n' STARTING BY '';

In [7]:

1%%sql2CREATE PIPELINE IF NOT EXISTS `nation_pipeline`3    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/nation'4    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5    BATCH_INTERVAL 25006    DISABLE OUT_OF_ORDER OPTIMIZATION7    DISABLE OFFSETS METADATA GC8    SKIP DUPLICATE KEY ERRORS9    INTO TABLE `nation`10    FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11    LINES TERMINATED BY '|\n' STARTING BY '';

In [8]:

1%%sql2CREATE PIPELINE IF NOT EXISTS `orders_pipeline`3    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/orders'4    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5    BATCH_INTERVAL 25006    DISABLE OUT_OF_ORDER OPTIMIZATION7    DISABLE OFFSETS METADATA GC8    SKIP DUPLICATE KEY ERRORS9    INTO TABLE `orders`10    FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11    LINES TERMINATED BY '|\n' STARTING BY '';

In [9]:

1%%sql2CREATE PIPELINE IF NOT EXISTS `partsupp_pipeline`3    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/partsupp'4    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5    BATCH_INTERVAL 25006    DISABLE OUT_OF_ORDER OPTIMIZATION7    DISABLE OFFSETS METADATA GC8    SKIP DUPLICATE KEY ERRORS9    INTO TABLE `partsupp`10    FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11    LINES TERMINATED BY '|\n' STARTING BY '';

In [10]:

1%%sql2CREATE PIPELINE IF NOT EXISTS `part_pipeline`3    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/part'4    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5    BATCH_INTERVAL 25006    DISABLE OUT_OF_ORDER OPTIMIZATION7    DISABLE OFFSETS METADATA GC8    SKIP DUPLICATE KEY ERRORS9    INTO TABLE `part`10    FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11    LINES TERMINATED BY '|\n' STARTING BY '';

In [11]:

1%%sql2CREATE PIPELINE IF NOT EXISTS `region_pipeline`3    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/region'4    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5    BATCH_INTERVAL 25006    DISABLE OUT_OF_ORDER OPTIMIZATION7    DISABLE OFFSETS METADATA GC8    SKIP DUPLICATE KEY ERRORS9    INTO TABLE `region`10    FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11    LINES TERMINATED BY '|\n' STARTING BY '';

In [12]:

1%%sql2CREATE PIPELINE IF NOT EXISTS `supplier_pipeline`3    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/supplier'4    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'5    BATCH_INTERVAL 25006    DISABLE OUT_OF_ORDER OPTIMIZATION7    DISABLE OFFSETS METADATA GC8    SKIP DUPLICATE KEY ERRORS9    INTO TABLE `supplier`10    FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\'11    LINES TERMINATED BY '|\n' STARTING BY '';

In [13]:

1%%sql2START PIPELINE customer_pipeline;3START PIPELINE lineitem_pipeline;4START PIPELINE nation_pipeline;5START PIPELINE orders_pipeline;6START PIPELINE partsupp_pipeline;7START PIPELINE part_pipeline;8START PIPELINE region_pipeline;9START PIPELINE supplier_pipeline;

[Optional Step] Check data ingestion in real-time with Perspective

In [14]:

1import perspective2import threading3import random4import time5import warnings6from datetime import datetime, date7from perspective import Table, PerspectiveWidget8warnings.filterwarnings('ignore')

In [15]:

1def loop():2    while mode != 'stop':3        while mode == 'run':4            table.update(data_source())5            time.sleep(1)

In [16]:

1def data_source():2    result = %sql SELECT sum(rows_streamed) AS rows_streamed FROM information_schema.pipelines_batches_summary WHERE database_name = 's2_tpch_unoptimized';3    result2 = list(result.dicts())4    return result25
6SCHEMA = {7    "rows_streamed": int8}

In [17]:

1mode = 'run'2table = perspective.Table(SCHEMA, limit=100)3threading.Thread(target=loop).start()

In [18]:

1perspective.PerspectiveWidget(table,title = "Track Row Ingestion",plugin="Y Line",columns=["count_rows"])

In [19]:

1mode = 'stop'

Now, let's see the performance of a few queries

In [20]:

1%%sql2# TPC-H Query 1: Pricing Summary Report3SELECT4    l_returnflag,5    l_linestatus,6    SUM(l_quantity) AS sum_qty,7    SUM(l_extendedprice) AS sum_base_price,8    SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,9    SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,10    AVG(l_quantity) AS avg_qty, AVG(l_extendedprice) AS avg_price,11    AVG(l_discount) AS avg_disc,12    COUNT(*) AS count_order13FROM s2_tpch_unoptimized.lineitem14WHERE l_shipdate <= DATE('1998-12-01') - INTERVAL '90' DAY15GROUP BY l_returnflag, l_linestatus16ORDER BY l_returnflag, l_linestatus;

In [21]:

1%%sql2# TPC-H Query 4: Order Priority Checking3SELECT4    o_orderpriority,5    COUNT(*) AS order_count6FROM7    s2_tpch_unoptimized.orders8WHERE9    o_orderdate >= DATE('1993-07-01')10    AND o_orderdate < DATE('1993-10-01')11    AND EXISTS (12        SELECT *13        FROM s2_tpch_unoptimized.lineitem14        WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate15    )16GROUP BY o_orderpriority17ORDER BY o_orderpriority;

In [22]:

1%%sql2# TPC-H Query 21: Suppliers Who Kept Orders Waiting3SELECT4    s_name,5    COUNT(*) AS numwait6FROM7    s2_tpch_unoptimized.supplier,8    s2_tpch_unoptimized.lineitem l1,9    s2_tpch_unoptimized.orders,10    s2_tpch_unoptimized.nation11WHERE12    s_suppkey = l1.l_suppkey13    AND o_orderkey = l1.l_orderkey14    AND o_orderstatus = 'F'15    AND l1.l_receiptdate > l1.l_commitdate16    AND EXISTS (17        SELECT18            *19        FROM20            s2_tpch_unoptimized.lineitem l221        WHERE22            l2.l_orderkey = l1.l_orderkey23            AND l2.l_suppkey <> l1.l_suppkey24        )25        AND NOT EXISTS (26            SELECT27                *28            FROM29                s2_tpch_unoptimized.lineitem l330            WHERE31                l3.l_orderkey = l1.l_orderkey32                AND l3.l_suppkey <> l1.l_suppkey33                AND l3.l_receiptdate > l3.l_commitdate34        )35        AND s_nationkey = n_nationkey36        AND n_name = 'EGYPT'37GROUP BY38    s_name39ORDER BY40    numwait DESC,41    s_name42LIMIT 100;

Now, let's first focus on optimizing the performance

In [23]:

1%%sql2CREATE DATABASE IF NOT EXISTS s2_tpch_optimized

Action Required

Make sure to select the s2_tpch_optimized database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Now, let's create each table with optimized data structure:

We create a unique key through primary key. For example lineitem table needs both the orderkey and linenumber to identify rows by uniqueness
We create a shard key which will distribute data in an efficient way to perform fast join and filtering. For lineitem table since we perform joins and calculation based on the orderkey we create a shardkey with orderkey

In [24]:

1%%sql2CREATE TABLE IF NOT EXISTS `lineitem` (3  `l_orderkey` bigint(11) NOT NULL,4  `l_partkey` int(11) NOT NULL,5  `l_suppkey` int(11) NOT NULL,6  `l_linenumber` int(11) NOT NULL,7  `l_quantity` decimal(15,2) NOT NULL,8  `l_extendedprice` decimal(15,2) NOT NULL,9  `l_discount` decimal(15,2) NOT NULL,10  `l_tax` decimal(15,2) NOT NULL,11  `l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,12  `l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,13  `l_shipdate` date NOT NULL,14  `l_commitdate` date NOT NULL,15  `l_receiptdate` date NOT NULL,16  `l_shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,17  `l_shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,18  `l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,19  UNIQUE KEY `pk` (`l_orderkey`,`l_linenumber`) USING HASH,20  SHARD KEY `__SHARDKEY` (`l_orderkey`),21  KEY `l_orderkey` (`l_orderkey`) USING CLUSTERED COLUMNSTORE22);23
24CREATE TABLE IF NOT EXISTS `customer` (25  `c_custkey` int(11) NOT NULL,26  `c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,27  `c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,28  `c_nationkey` int(11) NOT NULL,29  `c_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,30  `c_acctbal` decimal(15,2) NOT NULL,31  `c_mktsegment` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,32  `c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,33  UNIQUE KEY `pk` (`c_custkey`) USING HASH,34  SHARD KEY `__SHARDKEY` (`c_custkey`),35  KEY `c_custkey` (`c_custkey`) USING CLUSTERED COLUMNSTORE36);37
38CREATE TABLE IF NOT EXISTS `nation` (39  `n_nationkey` int(11) NOT NULL,40  `n_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,41  `n_regionkey` int(11) NOT NULL,42  `n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,43  UNIQUE KEY `pk` (`n_nationkey`) USING HASH,44  SHARD KEY `__SHARDKEY` (`n_nationkey`),45  KEY `n_nationkey` (`n_nationkey`) USING CLUSTERED COLUMNSTORE46);47
48CREATE TABLE IF NOT EXISTS `orders` (49  `o_orderkey` bigint(11) NOT NULL,50  `o_custkey` int(11) NOT NULL,51  `o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,52  `o_totalprice` decimal(15,2) NOT NULL,53  `o_orderdate` date NOT NULL,54  `o_orderpriority` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,55  `o_clerk` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,56  `o_shippriority` int(11) NOT NULL,57  `o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,58  UNIQUE KEY `pk` (`o_orderkey`) USING HASH,59  SHARD KEY `__SHARDKEY` (`o_orderkey`),60  KEY `o_orderkey` (`o_orderkey`) USING CLUSTERED COLUMNSTORE61);62
63CREATE TABLE IF NOT EXISTS `part` (64  `p_partkey` int(11) NOT NULL,65  `p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,66  `p_mfgr` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,67  `p_brand` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,68  `p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,69  `p_size` int(11) NOT NULL,70  `p_container` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,71  `p_retailprice` decimal(15,2) NOT NULL,72  `p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,73  UNIQUE KEY `pk` (`p_partkey`) USING HASH,74  SHARD KEY `__SHARDKEY` (`p_partkey`),75  KEY `p_partkey` (`p_partkey`) USING CLUSTERED COLUMNSTORE76);77
78CREATE TABLE IF NOT EXISTS `partsupp` (79  `ps_partkey` int(11) NOT NULL,80  `ps_suppkey` int(11) NOT NULL,81  `ps_availqty` int(11) NOT NULL,82  `ps_supplycost` decimal(15,2) NOT NULL,83  `ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,84  UNIQUE KEY `pk` (`ps_partkey`,`ps_suppkey`) USING HASH,85  SHARD KEY `__SHARDKEY` (`ps_partkey`),86  KEY `ps_partkey` (`ps_partkey`,`ps_suppkey`) USING CLUSTERED COLUMNSTORE87);88
89CREATE TABLE IF NOT EXISTS `region` (90  `r_regionkey` int(11) NOT NULL,91  `r_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,92  `r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,93  UNIQUE KEY `pk` (`r_regionkey`) USING HASH,94  SHARD KEY `__SHARDKEY` (`r_regionkey`),95  KEY `r_regionkey` (`r_regionkey`) USING CLUSTERED COLUMNSTORE96);97
98CREATE TABLE IF NOT EXISTS `supplier` (99  `s_suppkey` int(11) NOT NULL,100  `s_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,101  `s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,102  `s_nationkey` int(11) NOT NULL,103  `s_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,104  `s_acctbal` decimal(15,2) NOT NULL,105  `s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,106  UNIQUE KEY `pk` (`s_suppkey`) USING HASH,107  SHARD KEY `__SHARDKEY` (`s_suppkey`),108  KEY `s_suppkey` (`s_suppkey`) USING CLUSTERED COLUMNSTORE109);

In [25]:

1%%sql2INSERT INTO s2_tpch_optimized.nation SELECT * FROM s2_tpch_unoptimized.nation;3INSERT INTO s2_tpch_optimized.lineitem SELECT * FROM s2_tpch_unoptimized.lineitem;4INSERT INTO s2_tpch_optimized.customer SELECT * FROM s2_tpch_unoptimized.customer;5INSERT INTO s2_tpch_optimized.orders SELECT * FROM s2_tpch_unoptimized.orders;6INSERT INTO s2_tpch_optimized.part SELECT * FROM s2_tpch_unoptimized.part;7INSERT INTO s2_tpch_optimized.partsupp SELECT * FROM s2_tpch_unoptimized.partsupp;8INSERT INTO s2_tpch_optimized.region SELECT * FROM s2_tpch_unoptimized.region;9INSERT INTO s2_tpch_optimized.supplier SELECT * FROM s2_tpch_unoptimized.supplier;

In [26]:

1%%sql2# TPC-H Query 1: Pricing Summary Report3SELECT4    l_returnflag,5    l_linestatus,6    SUM(l_quantity) AS sum_qty,7    SUM(l_extendedprice) AS sum_base_price,8    SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,9    SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,10    AVG(l_quantity) AS avg_qty, AVG(l_extendedprice) AS avg_price,11    AVG(l_discount) AS avg_disc,12    COUNT(*) AS count_order13FROM lineitem14WHERE l_shipdate <= DATE('1998-12-01') - INTERVAL '90' DAY15GROUP BY l_returnflag, l_linestatus16ORDER BY l_returnflag, l_linestatus;

In [27]:

1%%sql2# TPC-H Query 4: Order Priority Checking3SELECT4    o_orderpriority,5    COUNT(*) AS order_count6FROM7    s2_tpch_optimized.orders8WHERE9    o_orderdate >= DATE('1993-07-01')10    AND o_orderdate < DATE('1993-10-01')11    AND EXISTS (12        SELECT *13        FROM s2_tpch_optimized.lineitem14        WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate15    )16GROUP BY o_orderpriority17ORDER BY o_orderpriority;

In [28]:

1%%sql2# TPC-H Query 21: Suppliers Who Kept Orders Waiting3SELECT4    s_name,5    COUNT(*) AS numwait6FROM7    s2_tpch_optimized.supplier,8    s2_tpch_optimized.lineitem l1,9    s2_tpch_optimized.orders,10    s2_tpch_optimized.nation11WHERE12    s_suppkey = l1.l_suppkey13    AND o_orderkey = l1.l_orderkey14    AND o_orderstatus = 'F'15    AND l1.l_receiptdate > l1.l_commitdate16    AND EXISTS (17        SELECT18            *19        FROM20            s2_tpch_optimized.lineitem l221        WHERE22            l2.l_orderkey = l1.l_orderkey23            AND l2.l_suppkey <> l1.l_suppkey24        )25        AND NOT EXISTS (26            SELECT27                *28            FROM29                s2_tpch_optimized.lineitem l330            WHERE31                l3.l_orderkey = l1.l_orderkey32                AND l3.l_suppkey <> l1.l_suppkey33                AND l3.l_receiptdate > l3.l_commitdate34        )35        AND s_nationkey = n_nationkey36        AND n_name = 'EGYPT'37GROUP BY38    s_name39ORDER BY40    numwait desc,41    s_name42LIMIT 100;

Finally, let's do a side by side comparison between the optimized and unoptimized database

In [29]:

1from singlestoredb import create_engine2import sqlalchemy as sa3
4db_connection_unoptimized = create_engine(database='s2_tpch_unoptimized').connect()5db_connection_optimized = create_engine(database='s2_tpch_optimized').connect()

Here are a few queries that you can test side by side against. Overall you will notice an average of 4x improvement in performance

In [30]:

1sql_query4 = sa.text('''2SELECT3    o_orderpriority,4    COUNT(*) AS order_count5FROM6    s2_tpch_unoptimized.orders7WHERE8    o_orderdate >= DATE('1993-07-01')9    AND o_orderdate < DATE('1993-10-01')10    AND EXISTS (11        SELECT *12        FROM s2_tpch_unoptimized.lineitem13        WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate14    )15GROUP BY o_orderpriority16ORDER BY o_orderpriority;17''')

In [31]:

1sql_query21 = sa.text('''2SELECT3    s_name,4    COUNT(*) AS numwait5FROM6    supplier,7    lineitem l1,8    orders,9    nation10WHERE11    s_suppkey = l1.l_suppkey12    AND o_orderkey = l1.l_orderkey13    AND o_orderstatus = 'F'14    AND l1.l_receiptdate > l1.l_commitdate15    AND EXISTS (16        SELECT17            *18        FROM19            lineitem l220        WHERE21            l2.l_orderkey = l1.l_orderkey22            AND l2.l_suppkey <> l1.l_suppkey23        )24        AND NOT EXISTS (25            SELECT26                *27            FROM28                lineitem l329            WHERE30                l3.l_orderkey = l1.l_orderkey31                AND l3.l_suppkey <> l1.l_suppkey32                AND l3.l_receiptdate > l3.l_commitdate33        )34        AND s_nationkey = n_nationkey35        AND n_name = 'EGYPT'36GROUP BY37    s_name38ORDER BY39    numwait desc,40    s_name41LIMIT 100;42''')

In [32]:

1result = db_connection_optimized.execute(sql_query21)

In [33]:

1import time2import pandas as pd3import plotly.graph_objs as go4
5num_iterations = 106opt_times = []7
8for i in range(num_iterations):9    opt_start_time = time.time()10    opt_result = db_connection_optimized.execute(sql_query21)11    opt_stop_time = time.time()12    opt_times.append(opt_stop_time - opt_start_time)13
14unopt_times = []15for i in range(num_iterations):16    unopt_start_time = time.time()17    unopt_result = db_connection_unoptimized.execute(sql_query21)18    unopt_stop_time = time.time()19    unopt_times.append(unopt_stop_time - unopt_start_time)20
21x_axis = list(range(1, num_iterations + 1))22data = {23    'iteration': x_axis,24    'opt_times': opt_times,25    'unopt_times': unopt_times,26}27df = pd.DataFrame.from_dict(data)28
29fig = go.Figure()30
31# Adding optimized times to the plot32fig.add_trace(go.Scatter(x=df['iteration'], y=df['opt_times'], mode='lines+markers', name='Optimized Database'))33
34# Adding unoptimized times to the plot35fig.add_trace(go.Scatter(x=df['iteration'], y=df['unopt_times'], mode='lines+markers', name='Unoptimized Database'))36
37# Update y-axis and x-axis properties38fig.update_layout(39    title="Execution Time Comparison",40    xaxis_title="Iteration",41    yaxis_title="Time in Seconds",42    xaxis=dict(tickmode='array', tickvals=list(range(1, num_iterations + 1)))43)44
45# Show the plot46fig.show()

Details

About this Template

This notebook will help you understand how you can take advantage of SingleStoreDB distributed capability using TPCH-100.

This Notebook can be run in Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

Learn how to Optimize Performance with TPCH 100

Notebook

Learn how to Optimize Performance with TPCH 100

Context

About database partitions

About shard keys

About hash indexes

Let's first create the unoptimized database

Let's create all the tables in that database with no index, shard key or primary key

Now let's create the pipelines and run them to ingest data

[Optional Step] Check data ingestion in real-time with Perspective

Now, let's see the performance of a few queries

Now, let's first focus on optimizing the performance

Now, let's create each table with optimized data structure:

Finally, let's do a side by side comparison between the optimized and unoptimized database

Details

About this Template

Tags

License

See Notebook in action