whitepaper

SingleStoreDB Self-Managed with IBM

IBM provides a one-stop experience for the procurement, use, and management of SingleStoreDB Self-Managed. In addition, users will gain access to IBM’s global deployment capabilities and expert support, which includes client consultancy on individual database technologies and overall solution architectures.
Loading half a billion records in 40 minutes
Blog

Loading half a billion records in 40 minutes

Disclaimer. This will not be an apples to apples comparison with derwik, since we obviously don’t have the same dataset, and we need a much bigger machine to load everything into memory. But I believe this experiment will get the point across. So without further ado, let’s go through the steps.Adam Derewecki wrote a cool post about his experience loading half a billion records into MySQL. SingleStore is a MySQL-compatible drop-in replacement that is built from the ground up to run really fast in memory.Disclamer: this isn’t an apples-to-apples comparison with derwik since we don’t have his dataset and need a much beefier machine to load everything into memory.SchemaHere is the schema for the table. Note that the table has three indexes. On top of it memsql will automatically generate a primary index under the covers. We won’t be disabling keys.drop table if exists store_sales_fact;create table store_sales_fact(       date_key                smallint not null,       pos_transaction_number  integer not null,       sales_quantity          smallint not null,       sales_dollar_amount     smallint not null,       cost_dollar_amount      smallint not null,       gross_profit_dollar_amount smallint not null,       transaction_type        varchar(16) not null,       transaction_time        time not null,       tender_type             char(6) not null,       product_description     varchar(128) not null,       sku_number              char(12) not null,       store_name              char(10) not null,       store_number            smallint not null,       store_city              varchar(64) not null,       store_state             char(2) not null,       store_region            varchar(64) not null,       key date_key(date_key),       key store_region(store_region),       key store_state(store_state));HardwareWe have a very cool machine, with 64 cores and 512 GB of RAM, from Peak Hosting . You can rent one for yourself for a little under two grand a month. They were kind enough to give it to us to use for free. Here is a spec of one core.vendor_id       : AuthenticAMDcpu family      : 21model           : 1model name      : AMD Opteron(TM) Processor 6276stepping        : 2cpu MHz         : 2300.254cache size      : 2048 KBYou read that correctly, this machine has sixty-four 2.3 GHz cores and 512 GB of RAM or almost 8 times the largest memory footprint available in the cloud today, all on dedicated hardware with no virtualization overhead or resource contention with other unknown third parties.Loading efficientlyLoading data efficiently is actually not that trivial. The best way of doing it with SingleStore is to use as much CPU as you can get. Here are a few tricks that can be applied.1. Multi-insertsWe can batch inserts into 100 row multi-inserts. This will reduce the number of network roundtrips. Each roundtrip now accounts for 100 rows instead of one. Here is what multi-inserts look like.insert into store_sales_fact values('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #6 chicken noodle soup','SKU-#6','Store71','71','Lancaster','CA','West'),('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #5 golf clubs','SKU-#5','Store71','71','Lancaster','CA','West'),('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #4 brandy','SKU-#4','Store71','71','Lancaster','CA','West'),2. Load in parallel.Our customer has a sample file of  510,593,334 records. We can use the command line mysql client to pipe this file into SingleStore, but this would not leverage all the cores available in the system. So ideally we should spit the file to at least as many chunks as there are CPUs in the system.3. Increase granularitySplitting the file into 64 big chunks will introduce a data skew. The total data load time will be the time the slowest thread loads the data. To address this problem we will split the file into thousands of chunks. And every time a thread frees up we will start loading another chunk. So we split the file into 2000 chunks. 1 -rw-r--r--  1  35662844 2012-06-06 11:42 data10.sql 2 -rw-r--r--  1  35651723 2012-06-06 11:42 data11.sql 3 -rw-r--r--  1  35658433 2012-06-06 11:42 data12.sql 4 -rw-r--r--  1  35663665 2012-06-06 11:42 data13.sql 5 -rw-r--r--  1  35667480 2012-06-06 11:42 data14.sql 6 -rw-r--r--  1  35659549 2012-06-06 11:42 data15.sql 7 -rw-r--r--  1  35661617 2012-06-06 11:42 data16.sql 8 -rw-r--r--  1  35650414 2012-06-06 11:42 data17.sql 9 -rw-r--r--  1  35661625 2012-06-06 11:42 data18.sql10 -rw-r--r--  1  35667634 2012-06-06 11:42 data19.sql11 -rw-r--r--  1  35662989 2012-06-06 11:42 data1.sql4. Load script. The load script uses python multiprocessing library to load data efficiently. 1 import re 2 import sys 3 import os 4 import multiprocessing 5 import optparse 6 7 parser = optparse.OptionParser() 8 parser.add_option("-D", "--database", help="database name") 9 parser.add_option("-P", "--port", help="port to connect. use 3306 to connect to memsql and 3307 to connect to mysql", type="int") 10 (options, args) = parser.parse_args() 11 12 if not options.database or not options.port: 13     parser.print_help() 14     exit(1) 15 16 total_files = 2000 17 18 def load_file(filename): 19     try: 20         print "loading from cpu: %d" % os.getpid() 21         query = 'mysql -h 127.0.0.1 -D %s -u root -P %d < %s' % (options.database, options.port, filename) 22         print query 23         os.system(query) 24         print "done loading from cpu: %d" % os.getpid() 25     except e as Exception: 26         print e 27         pass 28 29 os.system('echo "delete from store_sales_fact" | mysql -h 127.0.0.1 -u root -P %d' % options.port) 30 p = multiprocessing.Pool(processes = 2*multiprocessing.cpu_count()) 31 for j in range(0, total_files): 32     p.apply_async(load_file, ['data/data%d.sql' % j]) 33 34 p.close() 35 p.join()5. Running itI started loading by issuing the following command.time python load.py -D test -P 3306After we do this let’s start htop to check the processor saturation. It looks pretty busy.
Read Now
Anuvu Speed Tests MariaDB, Snowflake & SingleStoreDB
Webinar

Anuvu Speed Tests MariaDB, Snowflake & SingleStoreDB

In this webinar, we’ll discuss Anuvu’s database journey with Senior Infrastructure Engineer Orlando Jimenez. We’ll review the challenges Anuvu faced with their previous architecture including MariaDB, MySQL, Elasticsearch and Snowflake, and the optimizations that led to 20X performance gains with SingleStoreDB.
Watch Now
Supercharge your SaaS Applications
Solution Brief

Supercharge your SaaS Applications

SingleStore powers fast modern applications for over 100 leading SaaS players and Tier-1 enterprises around the globe.
Read Now
Data Warehouse Augmentation with SingleStore
Solution Brief

Data Warehouse Augmentation with SingleStore

SingleStoreDB Self-Managed surpasses the limits of data warehouses to drive up to 20-100x faster time-to-insights, powering low-latency analytics and applications.
Read Now
SingleStoreDB Self-Managed with IBM
Solution Brief

SingleStoreDB Self-Managed with IBM

IBM provides a one-stop experience for the procurement, use, and management of SingleStoreDB Self-Managed. In addition, users will gain access to IBM’s global deployment capabilities and expert support, which includes client consultancy on individual database technologies and overall solution architectures.
Read Now