Native BSON support with SingleStore Kai™

SingleStore Kai™ is now GA on top of SingleStore 8.5!

SingleStore Kai is an endpoint you can enable on a SingleStore instance that enables MongoDB® client applications to use SingleStore. For GA, we deeply integrated the BSON data type with the engine. This enabled us to better optimize for query patterns used by MongoDB client applications, achieving better compatibility and performance.

We'll be filling in more documentation on the new functionality supporting SingleStore Kai in the near future. For now, let's spend a bit of time on the BSON data type itself.

BSON vs. JSON

BSON is the native document format of MongoDB and MongoDB client applications. It's a binary format similar in use to JSON, and is documented here.

If you write your application against MongoDB client drivers, you use the BSON data type without even needing to think about it much — but, it's a big part of what makes the MongoDB ecosystem work. Some key differences between JSON and BSON are:

BSON supports additional leaf data types, such as binary and datetime
BSON's maps are ordered
BSON's encoding is binary; it can't be directly rendered as text

If you read the specification, you'll see evidence that this format evolved over time and has some deprecations, unusual aspects and rough edges. But it has some nice properties, too:

Fields are usually length-prefixed to enable quick seeking through a document
The native support for binary and datetimes is useful for many types of client applications. This data needs to be serialized when stored as JSON
It's easier to parse BSON correctly versus JSON because the format is less ambiguous

A minor extension

One of the rough edges with the BSON type is that the specification makes no allowance for storing a value (like a string or integer) at the "top level." This was an awkward limitation for SingleStore because SingleStore allows storing JSON values at the top level (like 1 or true) and in general, we want BSON to function similarly to SingleStore's JSON data type.

To enable top-level values, we made use of the fact that all BSON documents per the specification always end in a null byte (\0). In SingleStore, any BSON buffer that ends in a null byte is a BSON document conformant to the specification.

However, if the BSON buffer ends in a non-null byte, it is to be interpreted as a BSON type code that describes the format of the preceding buffer, to be read as a value of that type per the BSON specification.

It's important to say that when using SingleStore Kai, the client application will never need to handle this format extension; it's fully compatible with MongoDB client drivers. But when used directly through SQL, it is possible to observe this as shown in the section on using BSON.

Using BSON

Tables created with SingleStore Kai use the BSON format by default. For example:

1
mytestdb>db.foo.insertOne({_id:1,a:1})
2
{ acknowledged: true, insertedId: 1 }
3

4
mytestdb>db.runCommand({sql:"SHOW CREATE TABLE mytestdb.foo"}).cursor.firstBatch[0]["Create Table"]
5
CREATE TABLE `foo` (
6
  `_id` bson NOT NULL,
7
  `_more` bson NOT NULL COMMENT 'KAI_MORE' ,
8
  `$_id` as BSON_NORMALIZE_NO_ARRAY(`_id`) PERSISTED longblob COMMENT 'KAI_AUTO' ,
9
  SHARD KEY `__SHARDKEY` (`$_id`),
10
  UNIQUE KEY `__PRIMARY` (`$_id`) USING HASH,
11
  SORT KEY `__UNORDERED` ()
12
) COMMENT 'KAI_CID:Q01mEgTSFgRXBItN'

The columns in this table have the following purposes:

_id	bson NOT NULL	The _id field which is required in all MongoDB documents.
_more	bson NOT NULL	All other fields.
$_id	BSON_NORMALIZE_NO_ARRAY(`_id`) PERSISTED longblob	Normalized version of _id - see below.

When data is inserted into this table through SingleStore Kai, the proxy distributes it across the columns by matching fields in the document to column names. Any field that does not match a column name gets added to _more. For example, after inserting the document {_id:"abc", a:1} the data stored is this:

1
mytestdb>db.runCommand({sql:"SELECT HEX(_id), HEX(_more) FROM mytestdb.foo"})
2
{ cursor:
3
   { firstBatch:
4
      [ { 'HEX(_id)': '040000006162630002',
5
          'HEX(_more)': '0C0000001061000100000000' } ],
6
     id: new Long("0"),
7
     ns: 'admin.$cmd.sql' },
8
  ok: 1 }

You can see that the final byte in _id is 0x2, the code for a BSON string. The _id value is an int32 length of 4, followed by the UTF-8 encoded payload and a null terminator. This is an example of the SingleStore BSON extension for top-level values mentioned earlier.

The final byte in _more is 0x0, indicating a regular BSON document. The length is 0xC and holds a single element, the {a:1}.

What's with $_id and normalization?

Another rough edge of BSON is how MongoDB compares BSON values, which has per-type logic. Most notably, all the numeric types are cross-comparable — BSON documents {a:0} where the 0 is either int32, int64, double or decimal128 all compare as equal, even though their encodings are different. This gets really weird for decimal128, which is a base-10 floating point instead of base-2 like regular IEEE-754 doubles — and the decimal128 format itself allows lots of different representations of the same number.

The functions BSON_NORMALIZE and BSON_NORMALIZE_NO_ARRAY transform BSON into a byte stream that preserves all the expected BSON comparison outcomes when compared byte-by-byte. This is used to implement BSON unique keys using normal SingleStore unique indexes.

A quick look under the hood

We've worked to optimize the entire lifecycle of BSON data in SingleStore.

When BSON is inserted into SingleStore, it's validated and stored using the same serialization in which the client sent it. This makes the initial insertion fast, as it's little more than a few buffer copies.

With ROWSTORE tables, it ends there — data is kept in memory in BSON format, ready to send back to the client immediately. This sort of table can be used to build predictably low-latency applications.

With tables using Universal Storage (the default table type), once enough data is loaded (typically a million rows, but this is configurable), the data is analyzed to determine its schema and encoded into a columnar format. Storing the data in this format enables fast analytics and can also often lead to a high degree of compression; for example, we can insert many rows into a table:

1
for(let i = 0; i < 10000; ++i) { 
2
  var bar = db.bar.initializeUnorderedBulkOp(); 
3
  for(let j = 0; j < 1000; ++j) {
4
    bar.insert({j:j,a:"hello world"}) 
5
  } 
6
  bar.execute(); 
7
}

After loading 2 million rows, with the simple example above the "_more" column is compressed at a >96% ratio over the raw BSON encoding, and takes up only around 4 MB. On real data sets, significant compression typically continues to apply. Not only does this economize storage usage, it also accelerates queries over the data set — especially when combined with the way column store execution only needs to load the fields involved in a query from storage, and leaves the rest untouched.

A note on reconstruction

When documents are read, they have to be reconstructed from the columnar data. This path is highly optimized, but if the documents have a large number of columns, it can involve a lot of IO and increase retrieval latency. If you want the lower-latency retrieval of ROWSTORE combined with the accelerated analytics of column store, you can create a column group index on the table:

1
db.createCollection("zzz",{columnGroup:true})

Now when rows are retrieved, they will be copied straight from the column group index and will not need to be reconstructed, which in some query patterns can lead to lower and more predictable latencies.

Just the beginning

There's more to say about the BSON type, which we'll cover later, and we'll be continuing to enhance it to support more use cases. We can't wait to see what customers do with it!

SingleStore Kai™ is generally available and integrated into SingleStore Helios cloud — it’s free to use (with our new Free Shared Tier), easy to set up during workspace creation, quick to migrate data and easy to run simple queries and commands directly in the cloud portal.

We would love to learn more about your use case. Feel free to reach out to our PM team and start a conversation on our Forums. Ready to try it out yourself? Activate your free SingleStore trial and receive $600 in cloud credits!