GCS Pipeline does not read the data

I gave all the permissions to the service account from google cloud storage but I’m still getting
Cannot extract data for pipeline. **Object actual_object_name.csv.gz has a nil content length

It reads the file name because it’s in the log, but anything else. The files are complete

CREATE PIPELINE pipeline_name
AS LOAD DATA GCS ‘bucket_name’
CREDENTIALS ‘{“access_id”: “XXXXXXX”, “secret_key”: “XXXXXXX”}’

The same, with the same source of data works fine for S3. Any help?

UPDATE: It works if I pick a file and manually remove from GCS from the metadata Content-Encondig=gzip and I left it blank

thanks in advance.

1 Like

Still, it doesn’t work. I tried several things. I tried by using “disable_gunzip”:true in the CONFIG. Because I saw that it works when I removed Content-Encondig=gzip from GCP file. I also disabled the Decompressive transcoding feature from GCP but still no results.

Any help is appreciated. Thanks

Sorry to hear you’re having issue here.

The only way you can be seeing this error is if GCS didn’t return Content-Length when we asked to download you .csv.gz file. Can you describe your set-up in a little more details so that I can reproduce this error on my side? Particularly, do you happen to have some custom configs for your bucket and/or objects? Especially something with https://cloud.google.com/storage/docs/transcoding ? I wonder if you’re storing .csv files and google compresses/decompresses them on-the-fly and you’re losing the Content-Length field…

OK. I see you tried couple of things. Can you, please, confirm that the pipeline is able to ingest a single file if the following is true:

  • associated object’s metadata DOES NOT have Content-Encoding: gzip
  • you are NOT using "disable_gunzip": true.

Meanwhile I’ll look how to make the pipelines add Accept-Encoding: gzip to make GCS send those file as-is with all the headers.

Thaaanks for your help. @yznovyak-ua, Still didn´t solve the problem but here is what you asked:

  • associated object’s metadata DOES NOT have Content-Encoding: gzip.
    This works, if I manually remove the gzip from that config en GCS. But I do not have control over there. It’s pushed by a provider.

  • you are NOT using "disable_gunzip": true.
    I tried with and without. And nothing works.

I also tried with the google transcoding. Is enabled by default, but I disabled it in the GCS object using Cache control - no-transform, and still didn’t work.

I don’t know what else to check

Are you running a self-managed MemSQL (SingleStore) or are you using a Cloud One (Helios aka SingleStore Managed Solution)?

Do you have any control over the files? Or those are provided by a 3-rd party?

@yznovyak-ua I’m running in self-managed Memsql. I have no control over the files in the GCS, those are 3rd party.

I mean, I can change manually some attribute, but not the way they are uploaded into GCS

@yznovyak-ua could you provide some help with this? is this a SingleStore bug?

@moralez.rodrigo Hey, sorry – I’m working on a workaround now. I hoped it would be easier, but so far running into random issues with GCS (like it returning Content-Length that is completely wrong).

hey @yznovyak-ua I know these bugs are difficult :). but do you have some clue? any information could help us to resolve this. I can also help to debug if needed.

@moralez.rodrigo hey – sorry, got sick last week (surprisingly not COVID) so just now catching up with the work. I’ll try to get back to you in today (in ~3-4 hours). In the mean while I have a weird workaround: would you consider using Google’s Pub/Sub to copy data from the source you don’t control to a GCS bucket you do control and set it up in a way that it removes the Content-Encoding: gzip?

No :frowning: we do not have control to do that. and the gsutil do not allow to modify that header after the file is created. We are blocked

@moralez.rodrigo Sorry for late reply – I’m still a little under the weather here, so I’ve been slower than usual, but I’ve got help from a teammate (@pmishchenko). We tried couple of ideas and they didn’t immediately work out, but I think we finally have a working solution. 99% that the fix is going to be released with the next minor release of the MemSQL, but meanwhile let me figure out if we can do a dev build for you…