JSON Analytics Enhancements with SingleStore 8.5

At the launch of our latest product release, SingleStore announced new improvements across the board. One key focus from SingleStore has been around JSON analytics, leading to the general availability of SingleStore Kai™. In our latest release, we've significantly enhanced our analytical capabilities for JSON data. This improvement focuses on delivering more efficient, faster aggregations over JSON arrays — marking a substantial advancement in how we process JSON data.

Why JSON data is important

JSON data is becoming increasingly important in modern data management due to its ability to store non-tabular structures — like arrays and key-value pairs — in complex nested levels. This makes JSON columns ideal for data that doesn't fit neatly into traditional relational databases, a common scenario in diverse data landscapes.

With JSON columns, you can store different data shapes in the same column providing flexibility to rapidly evolving applications. Storing data in JSON format also simplifies integration with external systems, APIs and web applications.

JSON in SingleStore

With its multi-model capabilities, SingleStore offers the best of both SQL and NoSQL, providing flexibility by supporting JSON data types, alongside JSON-specific operators and built-in functions that enable working with JSON using SQL queries.

Optimizing JSON data type is part of an ongoing effort at SingleStore to enable our customers to query semi-structured data effectively, and support their applications that deal with a variety of data formats. SingleStore has introduced features for fast row lookups and operations on JSON data. The JSON built-in functions and the unique SingleStore-encoded Parquet storage format are designed to minimize data that is read from a JSON column for executing queries. These enhancements are aimed at matching the performance of JSON columns to any top-level columns.

Vectorized JSON array aggregation

With the 8.5 release we have added support for vectorized arrays aggregations, further boosting the aggregation capabilities over arrays in JSON data. Vectorized execution, a method where data operations are conducted on batches rather than individual rows, is at the core of this improvement. This approach applies particularly to GROUP BY operations, where the aggregation computation is executed in batches — considerably accelerating the process.

By leveraging CPU vector instructions, like SIMD (Single Instruction, Multiple Data), this method enables quick and efficient processing of multiple data points within a single CPU instruction. Vectorization for aggregations over JSON arrays also avoids the cost of fully materializing rows prior to processing the JOIN operator. This allows the database engine to use the same JOIN and GROUP BY execution code path used for any normal join between two columnstore tables — minimizing the data that is read from the JSON column for executing queries.

Performance gains

To quantify the performance enhancements achieved with our new optimization, I conducted a test using the TPC-H dataset, available among SingleStore's sample datasets.The TPC-H dataset primarily consists of relational tables with standard columns.

For this test I have taken two tables: order and line items. The schema of the orders and line items tables can be reviewed here. To introduce a JSON array component, I aggregated all line items corresponding to each order into a JSON array. This array was then incorporated as a JSON column lineitems_json in the orders table. The dataset contains a total of 15 million orders. In the newly added JSON column, there are 60 million line items now represented as array elements.

A sample line item object in the lineitems_json column looks like this:

[

{

    "l_comment": "deposits around the quickly regular packa",

    "l_commitdate": "1996-05-06",

    "l_discount": 0,

    "l_extendedprice": 51239.88,

    "l_linenumber": 1,

    "l_linestatus": "O",

    "l_partkey": 754369,

    "l_quantity": 36,

    "l_receiptdate": "1996-07-07",

    "l_returnflag": "N",

    "l_shipdate": "1996-06-25",

    "l_shipinstruct": "COLLECT COD",

    "l_shipmode": "AIR",

    "l_suppkey": 54370,

    "l_tax": 0.01

}

]

The following are some aggregation queries applied over arrays in the JSON column.

Query 1

This query calculates the total number of items for each shipping mode. It involves aggregating (sum) and grouping (GROUP BY) on fields within the JSON array.

1
SELECT t.table_col::$l_shipmode, sum(t.table_col::%l_quantity) as quantity
2
FROM orders
3
JOIN TABLE(JSON_TO_ARRAY(lineitems_json)) t
4
GROUP BY 1

Query 2

This query counts the total number of line items in each order priority category. It groups by on a regular column of the 'orders' table, and aggregates from the JSON array.

1
SELECT o_orderpriority as priority, sum(t.table_col::$l_quantity) as quantity,
2
FROM orders
3
JOIN TABLE(JSON_TO_ARRAY(lineitems_json)) t
4
GROUP BY 1

Query 3

The third query is a slightly more complex aggregation involving multiple fields.

1
SELECT
2
   t.table_col::$l_returnflag as r,
3
   t.table_col::$l_linestatus as s,
4
   sum(t.table_col::%l_quantity) as sum_qty,
5
   sum(t.table_col::%l_extendedprice) as sum_base_price,
6
   sum(t.table_col::%l_extendedprice * (1 - t.table_col::%l_discount)) as 
7
sum_disc_price,
8
   avg(t.table_col::%l_quantity) as avg_qty
9
FROM orders
10
JOIN TABLE(JSON_TO_ARRAY(lineitems_json)) t
11
GROUP by r, s

The results were measured on a S-2 workspace.

Query	Before optimization	After optimization	Improvement
Query 1 and Query 2	23 seconds	1.3 seconds	18x faster
Query 3	27 seconds	2.1 seconds	13x faster

SingleStore has added a significant performance boost that enables faster, more responsive analytics over JSON data that contains arrays. As the data volume increases — and the size and number of array fields grow — you can expect to observe even greater gains.

We are continuing the optimizations to further improve different query shapes that involve filters on JSON array fields while aggregating and working with multi-level nested arrays.

In summary, the ability to store data in JSON columns, particularly those involving arrays and subdocuments, offers a convenient and efficient method for managing complex data structures and this innovation opens up new possibilities for advanced analytics on JSON data. With Kai you would be able to use MongoAPI and commands along with SQL to query your data!

Interested in trying out our JSON capabilities? Get started with SingleStore today.