JSON Builtins Over Columnstore

Due to its ubiquity and versatility, JavaScript Object Notation (JSON) is a critical data format for efficient database support.

SingleStore has made concerted efforts toward this end, including the addition of Seekable JSON¹ in our 8.0 release which delivered significant performance improvements in OLAP and OLTP workloads using JSON. SingleStore has continued to improve its JSON support by adding new builtin functions, which leverage the underlying storage format used in its columnstore tables. The following blog discusses the technical specifics of how these builtins were implemented and why the design enables efficient analytical queries over SingleStoreDB’s JSON column type.

JSON Under the Hood

Internally, JSON data is stored using SingleStore-encoded Parquet². It works by inferring a schema from the key structure of the inserted JSON data and subsequently striping the data into homogenous columns by keypath. The inferred schema forms a tree structure with each column corresponding to a leaf in that schema tree.

The nesting structure of each JSON document is maintained in the same manner described in the Dremel paper³ — via definition and repetition levels in conjunction with the aforementioned schema tree. The specifics of how this information is used to reassemble JSON documents is described in the Dremel paper.

Important here is that storing JSON data in this columnar Parquet format has several advantages. High compression is achieved through a combination of keeping columns of homogeneous type, avoiding repeatedly storing JSON keys and minimizing the range of values for definition and repetition levels to allow bit-packing. And by striping JSON documents into separate columns by keypath, it becomes possible to only read from disk the portion of the document that is relevant to the query.

These aspects of SingleStore-encoded Parquet are leveraged into efficient builtin functions, providing powerful ways to interact with JSON data.

Example

To visualize how SingleStore’s builtin functions are optimized over columnstore JSON, the following example is provided. Assume this JSON is inserted into a JSON column j_col.

{
	"id": 1, "time": "2022-11-17 23:03:54",
      "lineitems": [
	{
		"line_num": 1,
		"SKU": 688968,
		"description": "Apple iPhone 14 Plus",
		"qty" : 1
	},
        {
		"line_num": 2,
		"SKU": 6514052,
		"description": "Energizer - MAX AA Batteries (8 Pack)",
		"qty" : 3
	},
        {
"line_num" : 3,
                    "SKU" : 6457697,
                    "description": "Star Wars - Galactic Snackin Grogu",
                    "qty" : 1
		}
	]
}

The resulting inferred schema would likely be:

With this schema, six leaf columns are constructed: id, time, line_num, SKU, description and qty.

JSON_EXTRACT

The JSON_EXTRACT⁴ functions are the most commonly used builtins for JSON in SingleStoreDB, and return values from a JSON document at a specified keypath. You will probably notice that this fits perfectly with SingleStore’s internal JSON storage since all of the data relevant to a JSON_EXTRACT query will exist in some subtree (and ideally a single leaf) of the inferred schema. Thus, only a subset of leaf columns must be read — which is much faster than reading the entire JSON document.

Internally, when a JSON_EXTRACT function is called, SingleStore creates a “fake JSON column” which acts like any other table column. The type of this column is determined by the specific JSON_EXTRACT variant (e.g. JSON_EXTRACT_BIGINT()). Information is then passed down to columnstore, which instructs the readers to only read the Parquet leaf columns corresponding to the keypath used in the JSON_EXTRACT call.

In an ideal case, the keypath corresponds to a single leaf column. For the provided example, JSON_EXTRACT_STRING(j_col, ‘time’) would be such a query. The data from the time column can be directly read from the disk into the “fake JSON column” as if it were a top-level table column.

Even if the keypath is not a leaf in the schema, there is still opportunity for optimization.This is because the structure of SingleStore-encoded Parquet enables efficient reconstruction of sub-documents. If JSON_EXTRACT_JSON(j_col, ‘lineitems’) were queried on the example schema, only the line_num, SKU, description and qty columns would need to be read. The JSON array would be reconstructed from these four leaf columns — using the same algorithm for full JSON reconstructions — and subsequently written into the “fake JSON column''.

JSON_MATCH_ANY

One of the unique properties of JSON is the ability to represent one-to-many relationships with JSON arrays. However, one of the limitations of the JSON_EXTRACT builtins is the inability to extract from within a JSON array without directly indexing the array. There are several ways in which one might want to interact with JSON arrays — one of which is to filter for a property within an array.

JSON_MATCH_ANY()⁵ fulfills this use case in a manner that is also optimized over SingleStore-encoded Parquet. Similarly to JSON_EXTRACT, this optimization also populates a “Fake JSON Column '' with the output of JSON_MATCH_ANY for each selected row.

One thing that differentiates JSON_MATCH_ANY from JSON_EXTRACT is the one-to-many relationship introduced by operating over JSON arrays. The way SingleStore-encoded Parquet handles JSON arrays lends itself very nicely to the functionality of JSON_MATCH_ANY, as the contents of an array are striped into a single repeated schema key. Thus if JSON_MATCH_ANY(j_col::?lineitems.qty, MATCH_PARAM_BIGINT_STRICT() 〉; 1) were queried on the example JSON, the only column that would need to be read is the qty column.

There are also finer-grained optimizations for JSON_MATCH_ANY. One example is the JSON_MATCH_ANY_EXISTS variant of the builtin, which checks for the existence of a keypath. Because the value at the keypath is irrelevant, there is no need to read and decode the values. The only thing that needs to be read is the definition level to determine how fully defined the keypath is. Another small optimization is the ability to directly seek to the next selected row when the JSON_MATCH_ANY predicate is satisfied (rather than iterating the remaining values for the current row). This builds off of the 8.0 Seekable JSON improvements.

JSON_INCLUDE_MASK

Another useful operation over JSON is the ability to construct a sub-JSON which contains a subset of the keys. This is achievable with the JSON_INCLUDE_MASK() function introduced in 8.1.

For example, one might query JSON_INCLUDE_MASK(j_col, ‘{“lineitems”:{“SKU”: 1, “qty”: 1}}’) to only retrieve the SKU and qty from the lineitems array. On the example JSON, the result of this query would be:

1
{
2
    "lineitems": [
3
        {"SKU": 688968, "qty" : 1},
4
        {"SKU": 6514052,"qty" : 3},
5
	{"SKU": 6457697,"qty" : 1}
6
    ]
7
}

As you might guess by now, JSON_INCLUDE_MASK is also made efficient by only reading a subset of the leaf columns. From these columns, the design of SingleStore-encoded Parquet allows for the reconstruction of these sub-JSONs using the same algorithm as is used for JSON_EXTRACT_JSON. This is accomplished by constructing a subschema from the original Parquet schema which only contains the keys from the mask.

Interestingly, JSON_INCLUDE_MASK also works over arrays. This means some queries using the JSON_TO_ARRAY() function can also be improved by reducing the amount of data in the array using JSON_INCLUDE_MASK.

Performance Stats

We inserted 10 million records with the above schema into a json column j_col of columnstore table t. The following queries were run on this data — the results are shown in the table here.

1
-- Query 1
2
SELECT MAX(j_col::%id) FROM t;
3
-- Query 2
4
SELECT j_col::%id
5
	FROM t
6
	WHERE JSON_MATCH_ANY(
7
		j_col::?lineitems.SKU,
8
		MATCH_PARAM_BIGINT_STRICT() = 650000);

	Average Latency (ms)		x Improvement
	Without Improvement	With Improvement
Query 1	10154.32	26.17	x388
Query 2	10408.02	395.20	x26

Both queries demonstrate significant reduction in query latency. From these numbers there are some clues as to how the SingleStore-encoded Parquet enables such improvement. Without the columnstore improvement both queries take about 10 seconds, because both queries have to read and reconstruct the entire JSON document. However, with the columnstore improvement we see latency drop by several orders of magnitude.

Query 1 completely ignores the lineitems array and directly writes integers to the “fake JSON column”, rather than reconstructing a document with keys.

Likewise, Query 2 only has to read the SKU leaf column. It is slower than Query 1 because it has to read about 10x the amount of data Query 1 reads (average array length is 10).

On this particular dataset, the advantages of JSON_INCLUDE_MASK are not particularly obvious; however when dealing with wide JSONs with many keys, JSON_INCLUDE_MASK shines. On a dataset with 50 keys in each record of the lineitems array and a query like the following, there was about 3.5x improvement — and the main bottleneck becomes the join on TABLE(JSON_TO_ARRAY()).

1
SELECT SUM(table_col::%qty)
2
	FROM t JOIN TABLE(JSON_TO_ARRAY(JSON_EXTRACT_JSON(
3
		JSON_INCLUDE_MASK(j_col,'{"lineitems":{"qty":1,"SKU":1}}'),
4
		"lineitems")))
5
	WHERE table_col::%SKU = 650000;

Conclusion

These JSON builtins offer very useful and efficient tools for interacting with JSON data. Their unique functionality is perfectly suited to be adapted for SingleStore-encoded Parquet. The end result is performance that mirrors that of any other top-level column, while retaining the core properties that make JSON extremely flexible and accessible.

https://www.singlestore.com/blog/winter-2022-universal-storage-part-5/
https://parquet.apache.org/
https://research.google/pubs/pub36632/
https://docs.singlestore.com/db/v8.0/en/reference/sql-reference/json-functions/json_extract_-type-.html
https://docs.singlestore.com/db/v8.0/en/reference/sql-reference/json-functions/json_match_any.html