Accessing JSON Data at Warp Speed in SingleStoreDB

AC

Akmal Chaudhri

Senior Technical Evangelist

Accessing JSON Data at Warp Speed in SingleStoreDB

The Winter 2022 Release of SingleStoreDB added performance improvements for transactional (OLTP) and analytical (OLAP) workloads — particularly, for working with JSON and string data.

This article will present several examples to demonstrate the speedup, so you can experience the performance improvements first hand.

introductionIntroduction

It’s no secret JSON is a wildly popular data format, used by developers and engineers all over. We’ve previously covered how SingleStoreDB supports JSON — with the latest innovation in 8.0 for seekable JSON. This enables JSON data to be stored in a compressed format, and when specific rows need to be retrieved, only the retrieved data needs to be decompressed. Figure 1 shows the previous approach for accessing JSON data (left-hand side) and the new approach (right-hand side). The new approach makes JSON suitable for both OLTP and OLAP applications, further cementing SingleStoreDB as the ideal, unified database.

Figure 1. Previous vs now

For developers working extensively with JSON, that means there are no changes in data format or JSON Functions, as enhancements are under the covers. What developers will notice, however, are the performance gains. For management, the improved performance means SingleStoreDB becomes a more compelling business solution, removing the need to integrate multiple products or databases, leading to better Total Cost of Ownership (TCO).

seekable-json-in-actionSeekable JSON in Action

On SingleStore's YouTube Channel, there is a great 5-minute video demonstrating “Faster Seeks for JSON Data on Universal Storage” — and the demo results are impressive. Interested in trying it for yourself? Fortunately, all the details are provided in the SingleStore documentation, and you can find the demo code in the JSON Performance section.

One way we can test the code locally is to install SingleStoreDB in a virtual machine environment (you can find the details for how to do that here). You can also give this a try using the SingleStoreDB Dev Image, which contains all the required software.

install-single-store-dbInstall SingleStoreDB

Let’s install SingleStoreDB in a virtual machine environment, which takes just a few minutes. For this article, we'll only need two tarball files:

Assuming a two-node cluster was correctly deployed and using the same variable names from the virtual machine set up, we can connect to our cluster from a MySQL CLI Client as follows:

mysql --local-infile -u root -h ubuntu -P 3306
--default-auth=mysql_native_password -p

Once connected to our cluster, we need to execute the JSON Performance demo code. The demo code produces over 8 million rows of data.
Results

Executing the code:

CALL get_avg_time(100);


three times with:

SET GLOBAL use_seekable_json = OFF;


produced the following results:

+------------+
| avg_time   |
+------------+
| 0.46785009 |
+------------+

+------------+
| avg_time   |
+------------+
| 0.49248448 |
+------------+

+------------+
| avg_time   |
+------------+
| 0.45003322 |
+------------+


Executing the code:

CALL get_avg_time(100);


three times with:

SET GLOBAL use_seekable_json = ON;


produced the following results:

+------------+
| avg_time   |
+------------+
| 0.00581854 |
+------------+

+------------+
| avg_time   |
+------------+
| 0.00237616 |
+------------+

+------------+
| avg_time   |
+------------+
| 0.00180840 |
+------------+


The results are summarized in Figure 2 using a Log Scale.

Figure 2. SingleStore JSON data example.

Taking the average, these results show we achieved over 140x performance improvement with seekable JSON with this small cluster set-up and environment.

bonus-using-los-angeles-police-department-crime-dataBonus: Using Los Angeles Police Department Crime Data

How does seekable JSON perform on real-world data? Let’s put it to the test with crime data provided by the Los Angeles Police Department — focusing on crime data from 2020 to present. The data can be downloaded in a CSV format, and is available under a Creative Commons CC0 1.0 Universal license. From the website:

“This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database.”

At the time of writing this article, the downloaded CSV file was approximately 163 MB.

convert-csv-to-jsonConvert CSV to JSON

First, we need to convert the CSV file to JSON. With some help from ChatGPT, the following Python code achieves this, creating correct JSON records using double quotes, nulls and one JSON record per line:

import pandas as pd
import json
import ndjson

input_file  = "rows.csv"
output_file = "rows.json"

df = pd.read_csv(input_file)
df.rename(columns = lambda x: x.replace(" ", "_"), inplace = True)

json_data = json.loads(df.to_json(orient = "records"))

with open(output_file, "w") as f:
    ndjson.dump(json_data, f)

The following shows an example JSON record with the keys displayed and the values removed.

{
  "DR_NO":
  "Date_Rptd":
  "DATE_OCC":
  "TIME_OCC":
  "AREA":
  "AREA_NAME":
  "Rpt_Dist_No":
  "Part_1-2":
  "Crm_Cd":
  "Crm_Cd_Desc":
  "Mocodes":
  "Vict_Age":
  "Vict_Sex":
  "Vict_Descent":
  "Premis_Cd":
  "Premis_Desc":
  "Weapon_Used_Cd":
  "Weapon_Desc":
  "Status":
  "Status_Desc":
  "Crm_Cd_1":
  "Crm_Cd_2":
  "Crm_Cd_3":
  "Crm_Cd_4":
  "LOCATION":
  "Cross_Street":
  "LAT":
  "LON":
}

The JSON file was approximately 432 MB in size.

resultsResults

You can find the complete code listed in the Appendix, which follows the structure of the earlier example. In the code in the Appendix, when loading the data into the database, we would replace /path/to/ with the actual path to where the JSON file was located. The JSON data load creates nearly 640,000 rows.

Executing the code:

CALL get_avg_time(100);


three times with:

SET GLOBAL use_seekable_json = OFF;


produced the following results:

+------------+
| avg_time   |
+------------+
| 0.17954817 |
+------------+

+------------+
| avg_time   |
+------------+
| 0.19996782 |
+------------+

+------------+
| avg_time   |
+------------+
| 0.17837315 |
+------------+

Executing the code:

CALL get_avg_time(100);


three times with:

SET GLOBAL use_seekable_json = ON;


produced the following results:

+------------+
| avg_time   |
+------------+
| 0.00427090 |
+------------+

+------------+
| avg_time   |
+------------+
| 0.00249579 |
+------------+

+------------+
| avg_time   |
+------------+
| 0.00249940 |
+------------+

The results are summarized in Figure 3 using a Log Scale.

Figure 3. Los Angeles Police JSON crime data example.

Taking the average, these results show we achieved over 60x performance improvement with seekable JSON with this small cluster set-up, small dataset and environment.

summarySummary

Our simple performance tests demonstrate that seekable JSON can provide significant speedup. With mixed workloads consisting of both OLTP and OLAP, SingleStoreDB now supports fast access to JSON data in multiple environments, making it the product of choice for many applications.

appendixAppendix

SET GLOBAL use_seekable_json = OFF;

CREATE DATABASE crimes;

USE crimes;

CREATE TABLE crimes (
  id INT UNSIGNED PRIMARY KEY NOT NULL AUTO_INCREMENT,
  crime_doc JSON NOT NULL,
  SORT KEY(id)
);

LOAD DATA LOCAL INFILE '/path/to/rows.json'
INTO TABLE crimes
FORMAT JSON (
  crime_doc <- %
);

DELIMITER //
CREATE OR REPLACE PROCEDURE get_avg_time(n INT) AS
DECLARE
  _id INT;
  _od json;
  _oid INT;
  m INT;
  st DATETIME(6);
  et DATETIME(6);
BEGIN
  SELECT MAX(id) INTO m FROM crimes;
  st = now(6);
  FOR i in 1..n LOOP
    _oid = CEILING(m*RAND());

   SELECT id, crime_doc
   INTO _id, _od
   FROM crimes
   WHERE id = _oid;
  END LOOP;
  et = now(6);

  echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n
AS avg_time;
END
//
DELIMITER ;

OPTIMIZE TABLE crimes FULL;

CALL get_avg_time(100);

SET GLOBAL use_seekable_json = ON;

CREATE TABLE crimes2 LIKE crimes;

INSERT INTO crimes2 SELECT * FROM crimes;

OPTIMIZE TABLE crimes2 FULL;

DELIMITER //
CREATE OR REPLACE PROCEDURE get_avg_time(n INT) AS
DECLARE
  _id INT;
  _od json;
  _oid INT;
  m INT;
  st DATETIME(6);
  et DATETIME(6);
BEGIN
  SELECT MAX(id) INTO m FROM crimes2;
  st = now(6);
  FOR i in 1..n LOOP
    _oid = CEILING(m*RAND());

    SELECT id, crime_doc
    INTO _id, _od
    FROM crimes2
    WHERE id = _oid;
 END LOOP;
 et = now(6);
  echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n
AS avg_time;
END
//
DELIMITER ;

CALL get_avg_time(100);

Share