Performance Improvements in SingleStoreDB Self-Managed 6.5

With SingleStoreDB Self-Managed 6.5, the fastest database just got a lot faster. SingleStore is a performance-oriented product that capitalizes on several techniques to give excellent overall speed. These techniques include optimized disk-based and in-memory data structures, parallelism, scale out, compilation of queries to machine code, vectorization, and single instruction, multiple data (SIMD).

In SingleStoreDB Self-Managed 6.5, we advanced performance in many dimensions, including query execution (QE), query optimization (QO), data loading, and system management. SingleStoreDB Self-Managed 6.0 was already fast. Very fast. But we were not satisfied. QE performance is always a high priority for us since it’s what users see most when exploring data. We improved QE performance for the following operations:

Shuffle (data redistribution)
High-cardinality GROUP BY
IN-list filtering
Filtering of data stored with integer run-length encoding (RLE)
Internal memory allocation

For many users, the improved shuffle and high-cardinality GROUP BY performance will be the most noticeable advancements. In an internal test workload that we run based on the TPC-H benchmark, the average query speed improved by 2.2X, over a set of 22 queries. The IN-list filter, integer RLE filtering, and memory allocation improvements can all give 10X or more query speedups in the right situation.

The query optimizer and statistics systems have improved significantly. The blazing-fast SIMD QE we used to demonstrate a trillion-row-per-second query with our 6.0 release had gotten so fast that our query optimizer didn’t always know when to use it. So we re-calibrated the optimizer to use it when it’s best suited. Also, we’ve improved our statistics (histograms) to have much higher resolution, which will lead to better query plans.

Finally, loading and system management have improved. Loading from Amazon S3 cloud storage using our Pipelines feature can start in seconds in SingleStoreDB Self-Managed 6.5 versus minutes in SingleStoreDB Self-Managed 6.0 for large S3 buckets. Load speed for wide comma-separated value list files with quoted strings is over twice as fast. Columnstore loading and all other operations that require large sorts are far faster. And finally, restart recover for clusters with hundreds of databases is more than ten times as fast.

Query Performance

Here, we’ll examine all the query performance improvements in SingleStoreDB Self-Managed 6.5 in more depth.

Fast Shuffle

SingleStore is a distributed, shared-nothing database. Applications connect to aggregator nodes. Aggregators compile queries, start execution, and assemble results. The data is stored across multiple leaf nodes using hash partitioning (sharding). The execution of queries requires data to be moved between leaf nodes, as well as from leaves to aggregators.

Data movement takes time. And we don’t like it when queries take too much time. Which lead us to create Project Sanic. When Ya Gotta Go Fast! (Google images of “Sanic” and you’ll see our project mascot). Sanic is all about speeding up data movement in SingleStore.

Before SingleStoreDB Self-Managed 6.5, shuffle heavily relied on converting data to text form, then back to an internal binary form for processing at the receiving node. 6.5 shuffles data primarily in binary format. For certain types of data, especially numbers and dates, this saves a lot of CPU time on both the sending and receiving ends, because we don’t have to take time to convert to string form and back again.

Consider a simple example. Suppose we create this table:

create table t(a int, b int, shard(a), key(a));

Then we insert about 164 million rows into it, with unique integer values for every row for both columns a and b.

This table is sharded (hash partitioned) by column a across leaf nodes. Each leaf node has an index on column a. Now, suppose we want to run this query:

select count(distinct b) from t;

A highly parallelizable version of this query repartitions data by column b using hashing, then performs a local distincting operation on each leaf. Finally, these results are concatenated together at the aggregator, and the result is returned to the client.

The query above runs on a 4-leaf system with 16 cores and 16 partitions total in the following times:

SingleStoreDB Self-Managed 6.0	SingleStoreDB Self-Managed 6.5	Speedup (times)
26.57 sec	8.74 sec	3.04

This healthy 3.04 times speedup is due completely to shuffle performance improvements.

High-Cardinality GROUP BY

High-cardinality GROUP BY queries are ones that have many distinct grouping keys (say 10s of thousands or more). In 6.0, these queries computed a local grouping for each partition, and each of these was forwarded to the aggregator, which did the final grouping. If there were many partitions per node, this could have involved a lot of data transmission causing a network bottleneck and leaving the aggregator with a lot of work to do to form the final aggregate set.

In 6.5, performance of queries like this is improved by (a) doing a local aggregate at the leaf level of all the partitions on that leaf, (b) spilling the rows to the network if it turns out almost every GROUP BY key is distinct so the local aggregation is not helping, and (c) reducing memory usage by having each thread handle a subset of the keys.

The realistic best-case scenario query for this improvement is a GROUP BY that is shuffling (not shard key matching) and each partition has a large set of keys but the set of keys for each partition is roughly the same. Shard-key-matching GROUP BY is not affected by this change.

The key part of this improvement is the local aggregate at the leaf level, since it allows more of the work to be done by the leaves and less by the aggregators, and it reduces network traffic.

Performance Improvement:
This improvement can give several-times speedups for some queries. For example, query q22 in one of our internal test workloads derived from TPC-H speeded up by more than a factor of 2 based solely on this change.

IN-list Filter Improvement

Performance has been improved for queries that use IN-list filters. 6.0 required a full scan of the IN-list for each row. 6.5 uses a Bloom filter to check if there is any match to help increase scan speed.

For IN-list filters that disqualify the large majority of rows, this can give order-of-magnitude speedups. This improvement works for row store and column store tables. It is always enabled.

Here’s an example that shows the dramatic speedup possible for IN-list queries in 6.5. For a table t(i int) with 8 million rows of unique integers, in row store format, we ran the following queries using 6.0 and 6.5:

Query	6.5	6.0
`select count(*) from t where i in (1, 100, 200);`	0.23 sec	same
`select count(*) from t where i in (1, 2, ... 904);`	0.20 sec (60x speedup!)	12.0 sec

The second query, with an IN-list containing 904 elements, speeds up tremendously, while the first one, with a short IN-list, stays the same.

Integer RLE Filter Pushdown

Run-length encoding (RLE) is one of several compression strategies used for our columnstore tables. SingleStoreDB Self-Managed 6.5 can now push integer filters down to RLE-encoded data in the columnstore scan layer, giving significant speedup.Here’s a performance example. The table t(i int) is a columnstore with 10 million rows and 1,000 unique values, keyed on i. So it will be encoded with RLE.

select count(*) from t where i = 2;

The performance results on two cores with two partitions are as follows:

6.0	6.5
0.11 sec	0.002 sec

As you can see, this enhancement can provide order-of-magnitude improvements. It’s always on; there is no need to set any switches to control it.

Internal Memory Allocation Speedup

Query execution memory allocation performance was significantly improved. Some TPC-DS benchmark queries speed up by over 2X based on this change. This microbenchmark speeded up by a factor of 20 compared to SingleStoreDB Self-Managed 6.0:

select count(substr(str, 0, 9)) from t;

The improvement is in the internal memory allocator used by queries that contain non-trivial string expressions or create temporary rows. So queries such as this can become far faster.

Query Optimizer and Statistics

QE improvements are fantastic, but the QO system is always critical to make sure the right QE operators are used in the best possible way. Our query optimizer is the guidance system of the SingleStoreDB Self-ManagedMS.

Automatic Selection of Hash Group to Get Operations on Encoded Data

As we publicized in our Trillion Rows Per Second demo, in SingleStoreDB Self-Managed 6.0 we improved our query execution dramatically for single-table GROUP BY/aggregate queries, by using a new implementation of HashGroupBy that uses SIMD and operations on encoded data. In 6.5, we’ve improved the costing code in the query optimizer by calibrating it so it understands more accurately the relative speed of HashGroupBy and StreamingGroupBy operations. So queries like this now routinely get a HashGroupBy instead of StreamingGroupBy:

select stock_symbol, count(*) as c
from trade
group by stock_symbol
order by c desc
limit 10;

The table trade is a columnstore, sharded on id or stock_symbol and with key on stock_symbol.

In effect, our query optimizer has caught up with the new capabilities available in our query execution system. The result is that queries may now run up to 80 times faster, automatically, with no need for hints or other workarounds.

Advanced Histograms for Improved Filter Cardinality Estimation

SingleStoreDB Self-Managed 6.5 introduces a new type of histogram to represent what we sometimes refer to as “range” statistics. These histograms are termed “advanced histograms” and the existing histograms will be called “legacy histograms.” The advanced histograms hold substantially more information than the legacy histograms for certain data distributions. So, non-uniform data distributions with skew typically have much better estimates for less frequent values, and also better average estimates, than in SingleStoreDB Self-Managed 6.0. For example, for this data distribution:

+------+----------+
| v    | count(*) |
+------+----------+
|    1 |   320000 |
|    2 |      320 |
|    3 |      352 |
|    4 |      256 |
|    5 |        1 |
|    6 |      512 |
|    7 |      512 |
|    8 |      544 |
|    9 |      512 |
|   10 |      608 |
|   11 |        3 |
|   12 |      608 |
|   13 |      576 |
|   14 |      288 |
|   15 |      288 |
|   16 |      288 |
|   17 |      160 |
|   18 |      352 |
|   19 |        2 |
|   20 |      480 |
+------+----------+

And this query:

select count(*) from t where v = <constant>;

We get these estimates:

Value of `<constant\>`	actual	6.0 estimated	6.5 estimated
1	320,000	320,047	320,174
3	352	450	366
5	1	450	1

In 6.0, the high-frequency values (such as 1) crowded out information about the lower-frequency values (such as 5) in the histogram, causing loss of resolution. In 6.5, this crowding-out effect happens much less quickly. Normally up to 50 data values can be represented nearly exactly, regardless of data frequency skew. In a larger query, errors such as this can compound and cause a sub-optimal strategy to be chosen for a join (e.g. hash instead of nested loop), or a distributed operation (e.g. shuffle instead of broadcast). Since these errors are reduced in 6.5, overall query performance will improve.

Before-and-After Results, 6.0 to 6.5, TPC-H

Query performance in one of our internal test workloads derived from the TPC-H benchmark improved dramatically, about 2.2X on average, from 6.0 to 6.5. The primary reason for this is improved shuffle performance and, to a lesser extent, the high-cardinality GROUP BY performance improvement. Also, query q19 improved due to a rewrite to extract common conditions for OR ((a or b) or (a or c)) → (a and (b or c)).

SingleStore TPC-H Scale Factor 1000, May 2018, Hardware: 3 Amazon Web Services (AWS) r3.8xlarge leaves, 32 vcores each

Query	6.0 Time (sec)	6.5 Time (sec)	Speedup (%)	Speedup (times)	Notes
q1	5.824	5.38	7.62%	1.1
q2	3.537	2.977	15.83%	1.2
q3	45.661	18.052	60.47%	2.5
q4	6.437	6.47	-0.51%	1.0
q5	22.127	12.669	42.74%	1.7
q6	3.334	3.392	-1.74%	1.0
q7	17.478	9.641	44.84%	1.8
q8	4.97	4.023	19.05%	1.2
q9	78.55	33.866	56.89%	2.3
q10	15.395	11.287	26.68%	1.4
q11	3.654	3.495	4.35%	1.0
q12	3.821	4.252	-11.28%	0.9
q13	62.558	39.001	37.66%	1.6
q14	75.244	21.528	71.39%	3.5
q15	24.668	9.931	59.74%	2.5
q16	40.461	12.24	69.75%	3.3
q17	7.21	3.21	55.48%	2.2
q18	300	72.387	75.87%	4.1	query timed out with error on 6.0
q19	6.691	4.998	25.30%	1.3
q20	70.276	26	63.00%	2.7
q21	149.832	143.061	4.52%	1.0
q22	46.297	9.69	79.07%	4.8

Averages	33.0	20.8	37%	2.0
	(GEOMEAN)	(GEOMEAN)	(AVERAGE)	(AVERAGE)

†q18 did not run in under 300 sec on 6.0 so its time is given as the timeout of 300 sec.

Load Performance

Before you can experience the benefits of improved query speed, you have to get data into your database. Load performance is historically a strength of SingleStore. We’ve made it even better 6.5.

Speedup of CSV File Parsing

LOAD DATA commands have been made substantially faster. For wide tables with long comma-separated value (CSV) input lines, using strings enclosed with quotes or other characters, speedup of more than 2X is possible.

We micro-optimized parsing. The most important change only applies to LOAD DATA with a FIELDS [OPTIONALLY] ENCLOSED BY clause, and is always “on.” It’s expected to be most beneficial for tables with many columns and relatively long lines. It only affects work done on the aggregator during the LOAD, and it doesn’t apply to pipelines.

Loading 6GB of data – with average CSV line length of 1KB – into a 170 column table on a cluster on AWS with an m4.10xlarge aggregator + 4 m4.4xlarge leaves and 32 partitions total, we saw:

MySQL [pm]> LOAD DATA INFILE '/home/admin/portable/LOAN_KEY.csv' INTO TABLE LOAN_KEY FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
Query OK, 5029152 rows affected (31.14 sec)

Before the change, the same load took about twice as long:

MySQL [pm]> LOAD DATA INFILE '/home/admin/portable/LOAN_KEY.csv' INTO TABLE LOAN_KEY FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
Query OK, 5029152 rows affected (1 min 9.76 sec)

Pipelines from S3 Speedup

The time for the CREATE PIPELINE statement to return, when used on Amazon S3 cloud storage, has been reduced dramatically for S3 prefixes with very large numbers of files under them. For example, for a million files under a prefix, it previously took about 5 minutes, and now it typically takes under 2 seconds.

In addition, under an S3 path with a large number of files, when a new file is added, it will be loaded and visible in the SingleStore database much more quickly than before, typically in seconds rather than minutes.

Previously the time taken to create an S3 pipeline grows with the number of objects in the source S3 bucket (more precisely, the number of objects matching the specified prefix) and is unbounded. For example, create pipeline took about 8 minutes on a bucket with 1.6 million objects. The root cause of the issue is the long time it can take to list out all the files under and S3 bucket. In 6.5 we fixed this issue and the number of network round trips taken equals (number of SingleStore partitions) / 1,000, rounded up. Usually this translates to no more than 2 seconds.

The second use case that is improved in 6.5 follows the “continuous loading” pattern where news files are added to the bucket and we expect to be able to query the data in SingleStore as soon as possible after the files are uploaded. Previously, this latency grew linearly with the number of existing files in the bucket. For example, suppose there are 200,000 files and a new file is uploaded. The rows from these files will not start to be loaded for about 20 seconds. If there are 2 million files, the latency would be over 3 minutes. Since S3 is often used for historical data, it is not uncommon that buckets grow to the extent that real-time analytics becomes infeasible. (Note that the overall bandwidth is not affected and therefore one-time load from S3 using SingleStore pipelines prior to 6.5 is not problematic.) In 6.5, this latency is not affected by the number of files in the bucket, given that newly added keys are greater in sort order.

Best practices: To take best advantage of this improvement, it is recommended that customers add keys to their S3 bucket in ascending sort order. A random order of names in a large set of files can still cause a long delay before seeing the contents of a new file in the database. For example, if you name files with a string of the form “prefix_name_<timestamp>” where <timestamp> is of the form YYYY-MM-DD-HH:MM:SS.FFF then the file names will naturally be in sort order, so you’ll benefit from this new improvement.

Columnstore Compression Performance

Columnstore compression performance has been improved, particularly for dictionary-compressed column segments with under 2^16 values.

Overall, LOAD and heavy INSERT workloads for low- to medium-cardinality strings are about 2X faster.

This enhancement improves the speed of the background merger and flusher too.

Sort Speed Performance

The speed of the internal sort routine was improved. This is related to the improvements in the internal memory allocator that were previously mentioned. This improves performance of the background merger used by the columnstore, and any other internal system processes that sort data.

Conclusion

SingleStore has fundamental architectural advantages for performance compared with legacy SQL database systems, including scale out, compilation, in-memory structures, vectorization, and use of SIMD. In the 6.5 release, we’ve built on these architectural pillars to deliver tremendous performance gains. Some queries have speeded up over 60X, and some entire workloads have more than doubled in speed, with no application changes required.

SingleStore is a database on the move. We have a talented engineering team that is keeping up a tremendous pace of improvement from release to release. Given our fundamental advantages, run-anywhere flexibility, and great performance that is only getting better, SingleStore is a great train to get aboard with your demanding data management applications.

Performance Improvements in SingleStoreDB Self-Managed 6.5

Query Performance

Fast Shuffle

High-Cardinality GROUP BY

IN-list Filter Improvement

Integer RLE Filter Pushdown

Internal Memory Allocation Speedup

Query Optimizer and Statistics

Automatic Selection of Hash Group to Get Operations on Encoded Data

Advanced Histograms for Improved Filter Cardinality Estimation

Before-and-After Results, 6.0 to 6.5, TPC-H

Load Performance

Speedup of CSV File Parsing

Pipelines from S3 Speedup

Columnstore Compression Performance

Sort Speed Performance

Conclusion

On this page

Start building with SingleStore

Explore more resources

Performance Improvements in SingleStoreDB Self-Managed 6.5

query-performanceQuery Performance

fast-shuffleFast Shuffle

high-cardinality-group-byHigh-Cardinality GROUP BY

in-list-filter-improvementIN-list Filter Improvement

integer-rle-filter-pushdownInteger RLE Filter Pushdown

internal-memory-allocation-speedupInternal Memory Allocation Speedup

query-optimizer-and-statisticsQuery Optimizer and Statistics

Automatic Selection of Hash Group to Get Operations on Encoded Data

Advanced Histograms for Improved Filter Cardinality Estimation

before-and-after-results-6-0-to-6-5-tpc-hBefore-and-After Results, 6.0 to 6.5, TPC-H

load-performanceLoad Performance

speedup-of-csv-file-parsingSpeedup of CSV File Parsing

pipelines-from-s-3-speedupPipelines from S3 Speedup

columnstore-compression-performanceColumnstore Compression Performance

Sort Speed Performance

conclusionConclusion

On this page

related-readingRelated reading

Start building with SingleStore

Explore more resources

Query Performance

Fast Shuffle

High-Cardinality GROUP BY

IN-list Filter Improvement

Integer RLE Filter Pushdown

Internal Memory Allocation Speedup

Query Optimizer and Statistics

Before-and-After Results, 6.0 to 6.5, TPC-H

Load Performance

Speedup of CSV File Parsing

Pipelines from S3 Speedup

Columnstore Compression Performance

Conclusion

Related reading