Index for simple arrays (int or string) or JsonArrays?

Do I understand correctly that Singlestore does not currently support any kind of index for a simple array (int or string) or JsonArray?

We often have data like

tags = [“home”, “office”, “worldwide”]
tag_ids = [3839,38383,37838],
color = [‘red’,‘blue’,black’],
status = [1,3,2,10,47].
etc.

We often use these fields as filters. Indices would be very useful for this.

Contrary to your description for Json Index (about computed columns), this data has no other properties where you can make a unique field for an index out of the property.

I have not found a way to create a computed column from an array and then give it an index. It works with single json fields, but not with a json array. If there was a function ARRAY_TO_STRING(seperator), you could create a computed column from the json array and then give it a full text index. But I have not found such a function.

What is the best way to create an index for these simple arrays (or json-arrays) that are very often used as filters?

What I had in mind was creating a persisted computed column and just casting the json to text, like this:

create table r(j json, t as j persisted text, fulltext(t));
insert r values('["red","white","blue"]');
insert r values('["green","red"]');
optimize table r full; /* do this or wait long enough for flusher to update fulltext index */
select * from r where match(t) against('blue');
+------------------------+------------------------+
| j                      | t                      |
+------------------------+------------------------+
| ["red","white","blue"] | ["red","white","blue"] |
+------------------------+------------------------+

singlestore> select * from r where match(t) against('red');
+------------------------+------------------------+
| j                      | t                      |
+------------------------+------------------------+
| ["red","white","blue"] | ["red","white","blue"] |
| ["green","red"]        | ["green","red"]        |
+------------------------+------------------------+

If your text has blanks then you may need to convert them to underscores or something.
You might have to write an expression in the computed column definition to do that.

Another way to handle this is to normalize your lists into another table and join on the primary key of the original table, and put an index on the other table property value column. It’s most efficient to use an int or bigint surrogate key. E.g.,

create table s(id bigint, d datetime, shard(id));
create table s_set(s_id bigint, value text, shard(s_id), key(value));

singlestore> insert s values(1,now());
Query OK, 1 row affected (0.22 sec)

singlestore> insert s values(2,now());
Query OK, 1 row affected (0.00 sec)

singlestore> insert s_set values(1,"red");
Query OK, 1 row affected (0.21 sec)

singlestore> insert s_set values(1,"green");
Query OK, 1 row affected (0.00 sec)

singlestore> select * from s join s_set on s.id = s_set.s_id where s_set.value = "green";
+------+---------------------+------+-------+
| id   | d                   | s_id | value |
+------+---------------------+------+-------+
|    1 | 2023-10-18 22:50:43 |    1 | green |
+------+---------------------+------+-------+

If you need to reconstruct the sets you can join the found IDs back to s_set.

1 Like

Thanks for the quick reply. I have only been working with Singlestore for a few days. My reference to an “ARRAY_TO_STRING” was meant to convert the json array to a string. But apparently you don’t have to do that at all - I imagined that too complicated. I know the second way, but it requires creating additional tables, so we refrained from doing it.

I have now managed and created a “computed column” for these fields. This also works well. However, it seems that you have to limit yourself to “strings” in the json data. Integer data in the json does not work with full index unfortunately:

insert r values([969820431,965662313]);
select * from r where match(t) against('965662313');
Empty set (0,00 sec)

Regardless of the full text, would a Column Group Index speed up the "json_match_any“? Example:

create table r(j json, t as j persisted LONGTEXT, fulltext(t), COLUMN GROUP `cg_full` (*));
insert r values('["red","white","blue"]');
insert r values('["green","red"]');
optimize table r full;
SELECT * FROM r WHERE json_match_any(MATCH_PARAM_STRING_STRICT()='red', j);

Is the Column Group Index also used in the following query?
SELECT * FROM r WHERE JSON_ARRAY_CONTAINS_STRING(j, 'red');

P.S. Interestingly “json_match_any” and also “JSON_ARRAY_CONTAINS_STRING” work here also with the “computed column” which is declared as “LONGTEXT”?

SELECT * FROM r WHERE json_match_any(MATCH_PARAM_STRING_STRICT()='red', t);
SELECT * FROM r WHERE JSON_ARRAY_CONTAINS_STRING(t, 'red');

Thank you :slight_smile:

You only should use COLUMN GROUP for very wide tables, say >40 columns.

Your
insert r values([969820431,965662313]);
is missing quotation marks, so I’m not sure what that would do.

Also, you need to wait a while for the index to update, or do ‘optimize table full’ first to make it update the numbers.

If you still are not finding the numbers after that, comment back here.

Sorry I meant of course:

insert r values('[969820431,965662313]')

A json array with integer and not string values. These are not found by full text index.

singlestore> select * from r;
+------------------------+------------------------+
| j                      | t                      |
+------------------------+------------------------+
| [969820431,965662313]  | [969820431,965662313]  |
| ["red","white","blue"] | ["red","white","blue"] |
| ["green","red"]        | ["green","red"]        |
+------------------------+------------------------+

select * from r where match(t) against('965662313');
Empty set (0,00 sec)

Of course I did the “optimize table full”.

What happens if the table has fewer than <40 columns and has a COLUMN GROUP index? Does COLUMN GROUP have any disadvantages? Does it become slower, does it consume unnecessary resources? This is not clear from the description.

A column group takes additional space and takes time to update when you update the table. So unless the benefit to improved seek time is significant, you may not want to use a column group. It works great, often, for seeks when you have >40 columns.

See here for more details.

1 Like

full text index would have a larger size than an array index. It’d be great if singlestore supported multi-valued index.

Postgres goes a few steps further and supports gin index on json objects, so one can do things like “field” @> ‘[{“key”: “value”}]’ to check if {key: value} object is in side json field “field”.

I have a postgres project i am considering to move to singlestore for compression (rocksdb) and bottomless support, and was trying to avoid denormalizing data if i can avoid. But it might have to be done anyways …

@tehlike – Noted. I opened an internal feature request for this. We’ll keep an eye on this area.

1 Like