S3 csv pipeline randomly fails with code 416 but then reads the same files right - reason?

Pipelines with source S3 reads files in csv format and randomly fails with message
“Cannot extract data for pipeline. InvalidRange: The requested range is not satisfiable
status code: 416, request id: …, host id: …”

We can see in attached screenshot that these files are loaded but not on the first try.

We too can see that mostly number of rows on first try is bad, even can be more than in file.

What can be reason and how to understand this error message ?

Thanks!

Thank you for your question.

Are you using Cloud or Self-Hosted? Also which version are you using please?

Thank you.

Hi, achaudhri, I use self-hosted, version 7.8.1, @@version 5.7.32

Thank you.

Could you provide an example of your CREATE PIPELINE statement please?

Hi achaudhri,
It what I can show :slight_smile:

CREATE OR REPLACE PROCEDURE p_xxx(batch QUERY(fields list))…

CREATE OR REPLACE PIPELINE s3_xxx AS LOAD DATA
LINK DATA_S3 ‘xxx/yyy/*.csv.gz’ INTO PROCEDURE p_xxx
FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ‘"’
LINES TERMINATED BY ‘\r\n’
IGNORE 1 LINES;

ALTER PIPELINE s3_xxx SET BATCH_INTERVAL 1800000;

Thank you. I have forwarded to support at SingleStore.

Here is what we think is happening:

Our pipelines assume that once files are visible in S3, their contents will not change. It looks like the CSV files here may be concurrently modified/reuploaded. That error (and the different row counts between failed vs successful batches) suggests that the size & contents of a file changed while the pipeline was in the process of fetching the CSV. The batch_partition_parsed_rows that we report is the exact number of rows that we found as we parsed the file, up to the point where we hit the error.

Thanks, it’s idea, I will check it with attribution platform which writes these files to S3.