GCS Pipeline does not read the data

moralez.rodrigo · October 29, 2020, 5:40pm

I gave all the permissions to the service account from google cloud storage but I’m still getting
Cannot extract data for pipeline. **Object actual_object_name.csv.gz has a nil content length

It reads the file name because it’s in the log, but anything else. The files are complete

CREATE PIPELINE pipeline_name
AS LOAD DATA GCS ‘bucket_name’
CREDENTIALS ‘{“access_id”: “XXXXXXX”, “secret_key”: “XXXXXXX”}’
BATCH_INTERVAL 200
SKIP ALL ERRORS
INTO TABLE XXXXXX
FIELDS TERMINATED BY ‘,’ ENCLOSED BY ‘"’ ESCAPED BY ‘\’
LINES TERMINATED BY ‘\n’ STARTING BY ‘’
IGNORE 1 LINES

The same, with the same source of data works fine for S3. Any help?

UPDATE: It works if I pick a file and manually remove from GCS from the metadata Content-Encondig=gzip and I left it blank

thanks in advance.
Rodrigo

moralez.rodrigo · November 2, 2020, 3:06pm

Still, it doesn’t work. I tried several things. I tried by using “disable_gunzip”:true in the CONFIG. Because I saw that it works when I removed Content-Encondig=gzip from GCP file. I also disabled the Decompressive transcoding feature from GCP but still no results.

Any help is appreciated. Thanks

yznovyak-ua · November 3, 2020, 7:30pm

Sorry to hear you’re having issue here.

The only way you can be seeing this error is if GCS didn’t return Content-Length when we asked to download you .csv.gz file. Can you describe your set-up in a little more details so that I can reproduce this error on my side? Particularly, do you happen to have some custom configs for your bucket and/or objects? Especially something with Transcoding of gzip-compressed files | Cloud Storage | Google Cloud ? I wonder if you’re storing .csv files and google compresses/decompresses them on-the-fly and you’re losing the Content-Length field…

yznovyak-ua · November 3, 2020, 7:47pm

OK. I see you tried couple of things. Can you, please, confirm that the pipeline is able to ingest a single file if the following is true:

associated object’s metadata DOES NOT have Content-Encoding: gzip
you are NOT using "disable_gunzip": true.

Meanwhile I’ll look how to make the pipelines add Accept-Encoding: gzip to make GCS send those file as-is with all the headers.

moralez.rodrigo · November 3, 2020, 9:45pm

Thaaanks for your help. @yznovyak-ua, Still didn´t solve the problem but here is what you asked:

associated object’s metadata DOES NOT have Content-Encoding: gzip.
This works, if I manually remove the gzip from that config en GCS. But I do not have control over there. It’s pushed by a provider.
you are NOT using "disable_gunzip": true.
I tried with and without. And nothing works.

I also tried with the google transcoding. Is enabled by default, but I disabled it in the GCS object using Cache control - no-transform, and still didn’t work.

I don’t know what else to check

yznovyak-ua · November 3, 2020, 10:00pm

Are you running a self-managed MemSQL (SingleStore) or are you using a Cloud One (Helios aka SingleStore Managed Solution)?

Do you have any control over the files? Or those are provided by a 3-rd party?

moralez.rodrigo · November 3, 2020, 10:17pm

@yznovyak-ua I’m running in self-managed Memsql. I have no control over the files in the GCS, those are 3rd party.

moralez.rodrigo · November 3, 2020, 10:20pm

I mean, I can change manually some attribute, but not the way they are uploaded into GCS

moralez.rodrigo · November 9, 2020, 3:55pm

@yznovyak-ua could you provide some help with this? is this a SingleStore bug?

yznovyak-ua · November 9, 2020, 4:31pm

@moralez.rodrigo Hey, sorry – I’m working on a workaround now. I hoped it would be easier, but so far running into random issues with GCS (like it returning Content-Length that is completely wrong).

moralez.rodrigo · November 15, 2020, 12:13pm

hey @yznovyak-ua I know these bugs are difficult :). but do you have some clue? any information could help us to resolve this. I can also help to debug if needed.

yznovyak-ua · November 16, 2020, 1:54pm

@moralez.rodrigo hey – sorry, got sick last week (surprisingly not COVID) so just now catching up with the work. I’ll try to get back to you in today (in ~3-4 hours). In the mean while I have a weird workaround: would you consider using Google’s Pub/Sub to copy data from the source you don’t control to a GCS bucket you do control and set it up in a way that it removes the Content-Encoding: gzip?

moralez.rodrigo · November 16, 2020, 4:46pm

No we do not have control to do that. and the gsutil do not allow to modify that header after the file is created. We are blocked

yznovyak-ua · November 19, 2020, 10:34pm

@moralez.rodrigo Sorry for late reply – I’m still a little under the weather here, so I’ve been slower than usual, but I’ve got help from a teammate (@pmishchenko). We tried couple of ideas and they didn’t immediately work out, but I think we finally have a working solution. 99% that the fix is going to be released with the next minor release of the MemSQL, but meanwhile let me figure out if we can do a dev build for you…

moralez.rodrigo · December 9, 2020, 12:56pm

@yznovyak-ua Thank you. Was this issue resolved in the recent release? SingleStore 7.3?
I would like to know to decide if upgrade it or not