DATE format support while loading data from PARQUET file which located in S3

rjsv · January 17, 2020, 5:01am

I am loading data from S3(PARQUET file format) to MEMSQL using below pipeline. Unfortunately, data is loading successfully but data column related data is coming black. If I do the same with Csv file format I am getting the data. I have checked my paraquet file it contains the data properly. Please gudie me how to handle the date format columns through pipeline if the file is in parquet file format.

CREATE or replace PIPELINE rjsv
AS LOAD DATA S3 ‘mybucket/’
CONFIG ‘{“region”: “us-east-2”}’
CREDENTIALS ‘{“aws_access_key_id”: “AKIAXJE6”, “aws_secret_access_key”: “nbH/HPSA0PW1SHvB59”}’
INTO TABLE emp
(empno<- empno,
ename<- ename,
salary<- salary,
hiredate<- hiredate)
FORMAT PARQUET;

sasha · January 17, 2020, 6:12am

The most likely reason for this is that we don’t currently automatically convert parquet fields with time-related logical type annotations like TIME_MICROS to a format compatible with time-related SQL types.

There’s a workaround, but it depends on which logical type you have exactly. What is the schema of your parquet file?

Alternatively, what value do you see for a sample row if you ingest one of your files into a table where the date column instead has type longblob? And what time do you expect to see for that row instead?

rjsv · January 19, 2020, 2:39pm

Thank you very much for your prompt response.Please provide the workarounds. I will choose the best one which suits for me.

sasha · January 19, 2020, 3:35pm

The workaround is to use the SET clause to manually transform integral fields to datetime values:

MySQL [x_db]> create table t(time_col datetime(6), unconverted longblob);
Query OK, 0 rows affected (0.134 sec)

MySQL [x_db]> create pipeline p as load data fs "/tmp/sanity.parquet" into table t (@var <- time_field, unconverted <- time_field) format parquet set time_col = timestampadd(MICROSECOND, @var, from_unixtime(0));
Query OK, 0 rows affected (0.095 sec)

MySQL [x_db]> test pipeline p;
+------------------+----------------------------+
| unconverted      | time_col                   |
+------------------+----------------------------+
| 1579446791000000 | 2020-01-19 10:13:11.000000 |
+------------------+----------------------------+

I assume above that the parquet data has microsecond precision and is stored in UTC - that corresponds to the TIME_MICROS logical type. The precision or time zone of your data may be different. For example, for millisecond precision time stored in the local time zone, you’ll want:
timestampadd(MICROSECOND, @var * 1000, '1970-01-01') instead.

We’re working on just doing this automatically in an upcoming 7.0 dot release.