Pipeline Data Consumption

japillai.c · January 18, 2022, 3:17pm

We are using singleStore pipelines to consume data from KAFKA and store in SingleStore Columnstore tables. The data in Kafka is in JSON format and serialized into AVRO. While consuming this data from singlestore pipelines the data is getting appended with the data type in the JSON structure.Is there a way we can get away with the data type field ?

Sample data in kafka :
{
“troubleticket”:
{
“additionalInfo”:
{
“account”:{“string”:“1234”},
“accountTeam”:{“string”:“7678”},
}
}
}

Data in Singlestore consumed via SingleStore Pipelines :
{
“troubleticket”:
{
“additionalInfo”:
{
“account”:{“string”:“1234”},
“accountTeam”:{“string”:“7678”},
}
}
}

Data Expected in Singlestore :
{
“troubleticket”:
{
“additionalInfo”:
{
“account”:{ “1234”},
“accountTeam”:{ “7678”},
}
}
}

japillai.c · January 18, 2022, 3:20pm

Single Store Version 7.3

sasha · February 1, 2022, 12:29am

Are you loading the entire Avro record into a JSON-typed column? (Posting show create pipeline output would help).

The general rule is that we convert Avro union values to the “JSON encoding” of the union, which includes a wrapper object naming the selected type. We do have a convenience feature around loading a single union-with-null field into a single column, but we don’t currently apply this “recursively” when converting whole Avro records containing unions to JSON objects.

The most straightforward workaround would be to extract fields from Avro into @-variables and use the SET clause to construct the desired JSON object from it. You can see an example use of variables and the SET clause in Example 3 here: LOAD DATA · SingleStore Documentation.

If this is an on-prem deployment of SingleStore, then you could also use the TRANSFORM clause of CREATE PIPELINE to do your own Avro → JSON conversion with a schema registry aware Avro decoder (e.g. confluent_kafka API — confluent-kafka 2.1.0 documentation) in a transform script.

Note that you should expect some perf overhead from our Avro record → JSON conversion, as well as from querying a single JSON-typed column in place of a more structured table schema. If you do happen to wind up breaking the data out into separate columns, you should find that it’s also easier to formulate the pipeline that way.