Filesystem Pipeline Error Code 1933 Bug - pipeline continues to stat already 'Loaded' files

It appears that there is a bug in error triggering/reporting for pipelines using the filesystem as a source (and probably other “file based” sources); reporting error 1933 when it shouldn’t.

Once a source file (FILE_NAME) has been loaded and appears in the information_schema.PIPELINES_FILES view as FILE_STATE = ‘Loaded’, the pipeline continues to stat the file, even though it will never attempt to load data from that file name again (or at least until you do an ALTER PIPELINE DROP FILE ‘’)… which is different from the pipeline continuing to stat a 0 sized file (and not attempting to load it) until it becomes non-0 sized (and attempts to load it).

It seems to me (without running a strace) that the pipeline gets a directory listing and does a stat on each interesting/wanted file (matching any filename filter specified) in the directory listing. Normally this works out just fine, without any error. However, if you unlink the (already ‘Loaded’) file between the time that the directory listing is retrieved by the pipeline and the time that the pipeline does a stat on the file name, you get an ERROR_CODE = 1933 (ERROR_MESSAGE = ‘Cannot get source metadata for pipeline. stat /mnt/singlestorePipelines/pipelineTest3-1695194701.21536.ready: no such file or directory’) in the information_schema.PIPELINES_ERRORS view and the last batch is reported as having an error on the dashboard of SingleStore Studio. Despite this, and with @@pipelines_stop_on_error = 1, the pipeline (thankfully) continues to run.

Steps to reproduce:

SingleStore DB v8.1.16 (self hosted)
SingleStore Studio v4.0.13
…installed on Ubuntu 22.04.1
…using NFS to mount the pipeline source directory from another host with NFS mount options: ro,noac,intr

CREATE OR REPLACE PIPELINE `pipelineTest3` AS
LOAD DATA FS '/mnt/singlestorePipelines/pipelineTest3-*.ready'
BATCH_INTERVAL 75 DISABLE OFFSETS METADATA GC
INTO PROCEDURE `pipeline_pipelineTest3`
(payloadData <- %)
FORMAT JSON;

START PIPELINE pipelineTest3;

Then (with the destination/into procedure in place)…

  • make a “pipelineTest3-*.ready” (where * is a unix timestamp with milliseconds) file, with valid data, appear once every second
  • check the information_schema.PIPELINES_FILES view every 10 seconds to see if the FILE_STATE of the created files is ‘Loaded’; if so, unlink the file
  • check the information_schema.PIPELINES_ERRORS view for errors

Since the error is triggered by a race condition (i.e. unlinking the file between the time the directory listing is obtained by the pipeline and the time of the stat on the file), errors will only occur periodically (I’m seeing it happen about 1.2% of the time… 916 errors for 76436 pipeline source files processed) and will occur proportionately with the latency of the underlying filesystem, in this case an NFS mount to an XFS filesystem on another local network host, with both the DB and NFS host VM systems running on separate VMware hosts; as well as proportionately to how often you check PIPELINE_FILES and unlink ‘Loaded’ files. Also note that I’ve got the NFS client file attributes cache turned off with “noac”.

In an attempt to ensure that it wasn’t some sort of race condition between the PIPELINE_FILES view and the pipeline actually being done with loading the file, I have tried delaying the unlinking of files until 600 seconds after they appear in the PIPELINE_FILES view as ‘Loaded’. I’ve also tried doing an “ALTER… DROP FILE…” immediately following the unlinking, with no difference in the error occurring.

Ideal resolution: don’t stat already loaded files; they’ll never be used again, and it just slows down the pipeline (NFS latency).

Minimum resolution: check to see if ERROR_CODE = 1933 FILE_NAMEs are in the PIPELINE_FILES view before triggering/recording the error.

I trust this is enough to reproduce, but happy to provide more assistance if needed. If this isn’t a bug, and I’m missing something, then there’s a deficiency in the documentation. The only reference to error 1933 is with respect to Kafka pipelines (https://docs.singlestore.com/cloud/load-data/load-data-from-a-data-source/load-data-from-kafka/troubleshooting-kafka-aws-pipelines/) having a connection error.

I expect that this error may also occur errantly, too, with other “file based” pipeline sources (or maybe not, I’m not sure how getting their directory listings relate to stat’ing their files), when ‘Loaded’ files are removed, as well as correctly when there’s actually an underlying source connection problem.

Regards,

Daryl