Invalid JSON value for column for pipeline and stored procedure

ahkjeldsen · November 24, 2022, 10:53am

Hi

I’m trying to load data via a pipeline and I used to do this without a stored procedure, but now I need the use of the stored procedure. However, the pipeline now fails claiming I have invalid json. Previously it didn’t complain, but removed the characters not supported by the character set (not desired behavior, but I didn’t notice).

I’m getting the following error:
Invalid JSON value for column 'search_entry'

My sproc is defined as:

delimiter //

create or replace procedure search_loader_proc(batch query(client_id bigint, search_entry json collate utf8mb4_general_ci))

as

begin

    -- At some point we'll do something here, but now it just fails immediately

end //

delimiter ;

The pipeline is defined as:

create pipeline search_loader_pl as
load data
s3 '{...}'
config '{...}'
batch_interval 5000
into procedure search_loader_proc
format json
(
     `search_entry` <- `search`  default null,
    `client_id` <- `client_id` 
);

And the file that fails contains:

{"client_id":188502,"search":{"id":6860607798000749872,"term":"718134926477854cn","timestamp":1550870738000}}
{"client_id":188502,"search":{"id":6680607797497681070,"term":"dlx640481151nl","timestamp":1550871640000}}
{"client_id":188502,"search":{"id":8660607795855117842,"term":"dlm641035354cn","timestamp":1550872279000}}
{"client_id":188502,"search":{"id":6660807796649241451,"term":"4en030179682jp","timestamp":1550872489000}}
{"client_id":188502,"search":{"id":6660807796704819018,"term":"😊cx002632257no","timestamp":1550872563000}}
{"client_id":188502,"search":{"id":6660607798304807551,"term":"jlf484089544cn","timestamp":1550872697000}}
{"client_id":188502,"search":{"id":6660607795206686679,"term":"xrs331252679lv","timestamp":1550869311000}}

The specific line that fails is:

{"client_id":188502,"search":{"id":6660807796704819018,"term":"😊cx002632257no","timestamp":1550872563000}}

So I’m assuming its the emoji, but I’ve defined the search_entry parameter to be utf8mb4_general_ci. Any guidance on how to overcome this?

Tigran · December 5, 2022, 12:28pm

Could you please share more details on the issue and the version of SingleStore you are running?
I did couple of experiments with your data and it seem to work for me:

memsql> CREATE TABLE t(id BIGINT, term VARCHAR(20) collate utf8mb4_general_ci);
Query OK, 0 rows affected (1.322 sec)

memsql> CREATE PIPELINE test AS LOAD DATA FS 'example.json' INTO TABLE `t`(id <- search::id, term <- search::term) FORMAT JSON;
Query OK, 0 rows affected (0.176 sec)

memsql> test pipeline test;
+---------------------+-------------------+
| id                  | term              |
+---------------------+-------------------+
| 6860607798000749872 | 718134926477854cn |
| 6680607797497681070 | dlx640481151nl    |
| 8660607795855117842 | dlm641035354cn    |
| 6660807796649241451 | 4en030179682jp    |
| 6660807796704819018 | 😊cx002632257no    |
+---------------------+-------------------+
5 rows in set (0.189 sec)

The same applies to stored procedure:

memsql> CREATE TABLE t(id BIGINT, term VARCHAR(20) collate utf8mb4_general_ci);
Query OK, 0 rows affected (1.322 sec)

memsql> DELIMITER //
memsql> CREATE OR REPLACE PROCEDURE search_loader_proc(batch query(client_id BIGINT, search_entry TINYTEXT COLLATE utf8mb4_general_ci))
    -> AS
    -> BEGIN
    ->     INSERT INTO `t` SELECT client_id, search_entry FROM batch;
    -> END //
Query OK, 1 row affected (0.009 sec)

memsql> DELIMITER ;

memsql> CREATE OR REPLACE PIPELINE tp AS LOAD DATA FS 'example.json' INTO PROCEDURE `search_loader_proc`(client_id <- search::id, search_entry <- search::term) FORMAT JSON;
Query OK, 0 rows affected (0.050 sec)

memsql> TEST PIPELINE tp;
+---------------------+-------------------+
| client_id           | search_entry      |
+---------------------+-------------------+
| 6860607798000749872 | 718134926477854cn |
| 6680607797497681070 | dlx640481151nl    |
| 8660607795855117842 | dlm641035354cn    |
| 6660807796649241451 | 4en030179682jp    |
| 6660807796704819018 | 😊cx002632257no    |
+---------------------+-------------------+
5 rows in set (0.080 sec)

Could you please try the examples I posted and let me know if that works for you in addition to details about your SingleStore Engine version and table structure?

Thank you!

ahkjeldsen · December 6, 2022, 7:52am

Hi Tigran

Thank you for looking into this.
We’re running SingleStore DB Version: 7.8.14 and it’s being run as part of the SingleStoreDB Cloud.

The examples you provided are not examples of the case I’m trying to solve.
Your examples are not nested JSON, but flat JSON.
Your JSON scheme looks like:

{
  "client_id": number,
  "search_entry": string
}

where the JSON scheme of the data I have to process is:

{
  "client_id": number,
  "search": {
    "id": number,
    "term": string,
    "timestamp": number
  }
}

Currently it’s not a possible to flatten the JSON as I’m not in charge of the input data.

Tigran · December 6, 2022, 12:29pm

Ok, I got your point. Is this the error you are facing? This looks like a bug to me and I am able to reproduce this.

memsql> CREATE TABLE t(id BIGINT, term VARCHAR(20) collate utf8mb4_general_ci);
Query OK, 0 rows affected (0.165 sec)

memsql> DELIMITER //
memsql> CREATE OR REPLACE PROCEDURE search_loader_proc(batch query(client_id BIGINT, search_entry JSON COLLATE utf8mb4_general_ci)) AS
    -> BEGIN
    ->     RETURN;
    -> END //
Query OK, 1 row affected (0.004 sec)

memsql> DELIMITER ;
memsql> CREATE OR REPLACE PIPELINE tp AS LOAD DATA FS 'example.json' INTO PROCEDURE `search_loader_proc`(client_id <- search::id, search_entry <- search::term) FORMAT JSON;
Query OK, 0 rows affected (0.027 sec)

memsql> TEST PIPELINE tp;
ERROR 1844 (HY000): Leaf Error (localhost:3306): Leaf Error (127.0.0.1:3306): Invalid JSON value for column 'search_entry'
memsql>

I will create a ticket to investigate that closely, meanwhile, let’s try to find a workaround for that. You don’t need to modify the input data in order to flatten JSON, you can do that in the pipeline configuration similar to the example in my previous post, e.g. instead of mapping search to a JSON argument in the stored procedure, you can map text and integer elements one by one, like this:

DELIMITER //
CREATE OR REPLACE PROCEDURE search_loader_proc(batch QUERY(
    client_id BIGINT,
    id BIGINT,
    term TINYTEXT COLLATE utf8mb4_general_ci,
    ts BIGINT)) AS
BEGIN
    RETURN;
END //
DELIMITER ;

CREATE PIPELINE test AS
    LOAD DATA FS 'example.json'
    INTO PROCEDURE `search_loader_proc`(
        client_id <- client_id,
        id <- search::id,
        term <- search::term,
        ts <- search::timestamp)
    FORMAT JSON;

Please let me know if that works as a workaround.

ahkjeldsen · December 6, 2022, 1:41pm

Glad that you can actually reproduce and yes, that is the error I’m facing. Also really appreciate that you’ll investigate this further

Unfortunately, that workaround is not really feasible. To simplify the problem I removed a lot of extra parameters from the example. So what we’re essentially trying to do is have a single pipeline insert into many tables based on the availability of the parameters. So for instance, if the search_entry parameter is not null, then we’ll include those rows when inserting into our “search” table. Currently we have up to 15 parameters that is being populated. Each of these parameters corresponds to another JSON object that have between 5 and 20 properties. Flattening all this would result in hundreds of parameters and checking for availability would then require checking many parameters.

The procedure below would illustrate what we try to achieve. We would in many cases have 15 JSON parameters instead of just 3 as used in the example.

delimiter //

create or replace procedure loader_proc(batch query(client_id bigint, type1 json collate utf8mb4_general_ci, type2 json collate utf8mb4_general_ci, type3 json collate utf8mb4_general_ci))
as
begin
	insert into `type1_test` (client_id, ...)
	     select b.client_id, b.`type1`::%`some_property`, b.`type1`::`some_other_property` , /* more values */
	       from batch b
	      where b.type1 is not null;

	insert into `type2_test` (client_id, ...)
	     select b.client_id, b.`type2`::%`some_property`, b.`type2`::%`some_other_property` , /* more values */
	       from batch b
	      where b.type2 is not null

	insert into `type3_test` (client_id, ...)
	     select b.client_id, b.`type3`::`some_property`, b.`type3`::%`some_other_property`, /* more values */
	       from batch b
	      where b.type3 is not null;

	insert into `type3_test2` (client_id, ...)
	     select b.client_id, b.`type3`::`some_property`, b.`type3`::%`some_other_property`, /* more values */
	       from batch b
	      where b.type3 is not null;
end //

delimiter ;

The data we’re reading is the JSON serialization of a Protobuf message. The protobuf for the above would be:

message Data {
	int64 client_id = 1;
	oneof type {
		Type1 type1 = 2;
		Type2 type2 = 3;
		Type3 type3 = 4;
	}
}

message Type1 {
	int64 some_property = 1;
	string some_other_property = 2;
	/* more properties */
}

message Type2 {
	int64 some_property = 1;
	int64 some_other_property = 2;
	/* more properties */
}

message Type3 {
	string some_property = 1;
	int64 some_other_property = 2;
	/* more properties */
}

Not sure if you’re familiar with protobuf and the oneof syntax. What it means is that only one of the properties wrapped in the oneof block will be populated. Hence we check for null in the stored procedure, to know if the data is of that particular type.

I appreciate that you’ve tried to come up with a workaround though!

Tigran · December 6, 2022, 2:44pm

I tried to resolve it in a different way and eventually, I found a small issue in my example above. I was trying to bind a string to a JSON argument which was the main issue, and not the emoji. That brings us back to the point where I can’t reproduce this issue either on the latest version or on 7.8.14.

memsql> DELIMITER //
memsql> CREATE OR REPLACE PROCEDURE search_loader_proc(batch query(client_id BIGINT, search_entry JSON COLLATE utf8mb4_general_ci)) AS
    -> BEGIN
    ->     RETURN;
    -> END //
Query OK, 1 row affected (0.009 sec)

memsql> DELIMITER ;
memsql> CREATE OR REPLACE PIPELINE tp AS LOAD DATA S3 's3://tigran-test/example.json' INTO PROCEDURE `search_loader_proc`(client_id <- search::id, search_entry <- search) FORMAT JSON;
Query OK, 0 rows affected (1.594 sec)

memsql> TEST PIPELINE tp;
+---------------------+---------------------------------------------------------------------------------+
| client_id           | search_entry                                                                    |
+---------------------+---------------------------------------------------------------------------------+
| 6860607798000749872 | {"id":6860607798000749872,"term":"718134926477854cn","timestamp":1550870738000} |
| 6680607797497681070 | {"id":6680607797497681070,"term":"dlx640481151nl","timestamp":1550871640000}    |
| 8660607795855117842 | {"id":8660607795855117842,"term":"dlm641035354cn","timestamp":1550872279000}    |
| 6660807796649241451 | {"id":6660807796649241451,"term":"4en030179682jp","timestamp":1550872489000}    |
| 6660807796704819018 | {"id":6660807796704819018,"term":"😊cx002632257no","timestamp":1550872563000}    |
+---------------------+---------------------------------------------------------------------------------+
5 rows in set (1.225 sec)

Could you please try this example on your side and let me know if it works for you? If it fails, could you please share the error message and the results of the following query:

show variables like '%collation%';

Also, could you please upload your file with example data as is instead of pasting the contents? BTW, s3://tigran-test/example.json has public access, so you can use this S3 path for your tests.

ahkjeldsen · December 6, 2022, 3:28pm

I have submitted a support ticket with a full file that errors, perhaps you can lookup the data there. I can’t share the full data publicly.

I can’t seem to upload a file here, so I’ll just post another example inline.

{"client_id":493998,"search":{"id":6770607796435965704,"term":"id181g77854cn","timestamp":1550870730000}}
{"client_id":493998,"search":{"id":6660607798000779872,"term":"34d54cnd","timestamp":1550870738000}}
{"client_id":493998,"search":{"id":6660607777497681070,"term":"gdfdd51ndl","timestamp":1550871640000}}
{"client_id":493998,"search":{"id":6667707795855117842,"term":"lmddddn","timestamp":1550872279000}}
{"client_id":493998,"search":{"id":6660777796649241451,"term":"e45555p","timestamp":1550872489000}}
{"client_id":493998,"search":{"id":6660607777704819018,"term":"􏰜􏰌􏰆􏰇􏰎􏰌􏰇􏰇􏰚􏰛􏰝􏰍􏰆􏰁􏰞􏰁􏰌􏰜􏰌􏰆􏰇􏰎􏰌􏰇􏰇􏰚􏰛􏰝􏰍􏰆􏰁􏰞􏰁􏰌yydfd2257do","timestamp":1550872563000}}
{"client_id":493998,"search":{"id":6770607796304807551,"term":"lfdf544ckn","timestamp":1550872697000}}
{"client_id":493998,"search":{"id":6660607795207766679,"term":"rys33aaaa679lv","timestamp":1550869311000}}
{"client_id":493998,"search":{"id":6660607777409492686,"term":"466ddd2","timestamp":1550869750000}}
{"client_id":493998,"search":{"id":6660607796422562177,"term":"143aad0","timestamp":1550869770000}}

I tried the example you posted and it didn’t throw any errors.

These are the configured collations:

show variables like '%collation%';

collation_connection    utf8_general_ci
collation_database    utf8_general_ci
collation_server    utf8_general_ci
json_extract_string_collation    auto

However, we shouldn’t have to change these as we’re explicitly defining the collation in the procedure. Well at least according to the official documentation

Tigran · December 6, 2022, 3:31pm

Yeah, collation variables shouldn’t affect the result in this case, but since the example I sent worked for you I assume the problem is inside the file. Let me check it and come back to you.

UPD

I uploaded the file from the support ticket as is to my S3 bucket and tried the same pipeline. It still works for me. Could you please test it? This pipeline is pretty much the same as the one you mentioned in the question. I am under the impression there’s something wrong with your particular file. Did you check if it contains BOM?

DELIMITER //
CREATE OR REPLACE PROCEDURE search_loader_proc(batch query(client_id BIGINT, search_entry JSON COLLATE utf8mb4_general_ci)) AS
BEGIN
    RETURN;
END//
DELIMITER ;
CREATE OR REPLACE PIPELINE tp AS
    LOAD DATA S3 's3://tigran-test/1550620800000-1550966399000.json'
    BATCH_INTERVAL 5000
    INTO PROCEDURE `search_loader_proc`(
        client_id <- search::id,
        search_entry <- search DEFAULT NULL)
    FORMAT JSON;
TEST PIPELINE tp;

Note: this file is not public on S3, so you will not be able to test it yourself from my bucket.

Tigran · December 6, 2022, 4:04pm

We got it. I reproduced this locally on my machine with the real data. Let me check what’s going on here with that symbols and come back to you. Thanks for being patient.

ahkjeldsen · December 6, 2022, 4:05pm

As far as I can tell, none of our files contains a BOM.

$ file 1550620800000-1550966399000.json
1550620800000-1550966399000.json: New Line Delimited JSON text data

When you’re running TEST PIPELINE, how many rows does it fetch? The erroneous line in particular is at line 35888 and as far as I remember, testing a pipeline only fetches about 1000 rows or so.

Also when testing the pipeline, I was made to believe, that it only tests the fetching and potential transformations of data, but it doesn’t actually call the procedure/inserts into the table. At least, I’ve had to actually start the pipeline to have it fail.

ahkjeldsen · December 6, 2022, 4:05pm

Oh! Glad you were able to reproduce!
I’ll try and be patient

Tigran · December 6, 2022, 6:43pm

I checked that string inside JSON and it seems to me that the string was corrupted.
Here is the binary representation of the suspicious string inside JSON:

00000000: f48f b09c f48f b08c f48f b086 f48f b087  ................
00000010: f48f b08e f48f b08c f48f b087 f48f b087  ................
00000020: f48f b09a f48f b09b f48f b09d f48f b08d  ................
00000030: f48f b086 f48f b081 f48f b09e f48f b081  ................
00000040: f48f b08c f48f b09c f48f b08c f48f b086  ................
00000050: f48f b087 f48f b08e f48f b08c f48f b087  ................
00000060: f48f b087 f48f b09a f48f b09b f48f b09d  ................
00000070: f48f b08d f48f b086 f48f b081 f48f b09e  ................
00000080: f48f b081 f48f b08c 0a                   .........

I checked UTF-8 complete character table here: Complete Character List for UTF-8

<Plane 16 Private Use, First> (U+100000) f4808080
<Plane 16 Private Use, Last>  (U+10FFFD) f48fbfbd

Looks like the character you have in your string is in the private use range. Unfortunately, characters within private use range are not supported by the SingleStore engine and we have a ticket for that, but it’s certainly not an emoji, is that something you expected?

ahkjeldsen · December 7, 2022, 7:29am

In this case, the “terms” we get are search inputs made by our visitors. So we’re not in control of what they input. We’re not in charge of doing the searches, but simply collecting what inputs have been used for searching, hence we don’t restrict, transform or in anyway alter the input. Could seem like somebody intentionally put some very obscure content (or as you hint, the data was corrupted somewhere).

But this is a very nice find and I’m glad to hear that it’s expected from your end, that these characters are not supported. So now the question is whether or not I can do something to work around it. Is there a function in SingleStore that’ll strip all non-supported characters?

Thanks a lot of helping look into this!

Tigran · December 7, 2022, 10:15am

I got it. There are multiple ways of resolving this I can suggest. The easiest one would be just converting to hex.

CREATE OR REPLACE PIPELINE tp AS
    LOAD DATA FS 'example.json'
    INTO PROCEDURE `search_loader_proc`(
        client_id <- search::id,
        @search <- search)
    FORMAT JSON
    SET search_entry_hex = HEX(@search);

It’s not perfect since we can’t really use JSON as JSON. The other way is little-bit hacky. We can use regular expressions to remove term from JSON and then extract it and store as hex or base64 separately.

DELIMITER //
CREATE OR REPLACE PROCEDURE search_loader_proc(batch query(client_id BIGINT, term TINYTEXT, search_entry JSON COLLATE utf8mb4_general_ci)) AS
BEGIN
    RETURN;
END //
DELIMITER ;

CREATE OR REPLACE PIPELINE tp AS
    LOAD DATA FS 'example.json'
    INTO PROCEDURE `search_loader_proc`(
        client_id <- search::id,
        @search_term <- search::term,
        @search <- search)
    FORMAT JSON
    SET search_entry = regexp_replace(@search, '"term":"(.*)",', ''),
        term = HEX(@search_term);

You can also try using pipelines with transformation and strip these characters before they hit SingleStore.

Please let me know if any of these work for you.

ahkjeldsen · December 7, 2022, 11:44am

We don’t wish to store it as hex and we probably don’t want to use a transformation as I expect it’ll add a noticeable amount of time to our processing. We’re aiming at handling a couple of billion of rows in as short time as possible

I’ll keep this workarounds in mind.

Perhaps I can sneak in a little follow up question. We’ve seen cases where upserting into a table with a json column is 4-5 times slower than upserting into an identical table with the same column defined as longtext. The longtext column seemingly still support various json functions, so for now we’re using that. But I am wondering what could cause the upserts to become significantly slower when using json as the type.
Upserting with column defined as json was above 25ms per entry (sometimes > 100ms). Upserting with column defined as longtext was on average 5-7ms per entry.
I know this is sort of off topic, but maybe you know of some quirky behavior that we need to be aware of

Once again, thanks a lot for all your help! I really appreciate it!

Tigran · December 7, 2022, 12:00pm

I understand. The other way is to wait for the fix, but it may take some time until the fix is delivered to you even if we merge the fix today. The performance of the second option shouldn’t be bad if it resolves the problem. I will prioritize a ticket to extend our utfmb4 support for the private use areas.

Regarding your follow-up question. Overall, JSON is slower than a regular string because it includes additional steps of validation and parsing. I can’t tell off the top of my head if the performance you mentioned is expected (though 4-5 times sounds significant), but we can check your query to make sure it’s as effective as possible. Probably, it deserves a separate thread with an example.

ahkjeldsen · December 7, 2022, 12:08pm

I will reach out to the folks generating the data we have to parse. Maybe they can do something about the entries with characters within the private range. I’ll also keep an eye on the coming releases to see if a fix will be included.

Regarding the second option, it might work, but again we’ll need to do quite a few regex replaces as it’s not just this “term” column that can contain this sort of data Sort of worried about performance, but we can test it out so we have some actual metrics to base our decision on.
We cant use transformations as we’re running SingleStoreDB Cloud (at least according to the documentation).

Well it makes sense if the nodes tries to parse the json. Again, I’m just glad there’s some sense to it. The documentation states that json is essentially just a longtext column, that’s why I was a bit confused If we need the column to be defined as json and need to have the query optimized, I might reach out again (or reach out to Dario Bernardo who’s been helping us before).

Thanks again for taking your time to look into this