FileSystem Pipeline Giving Error

I am trying to explore pipelines for our application environment and created a basic pipeline as given on SingleStore Documentation site -

CREATE PIPELINE library
AS LOAD DATA FS ‘/app/inbound/books.txt’
INTO TABLE classic_books
FIELDS TERMINATED BY ‘,’;
The file is accessible to MA and it has all the required permissions and Pipeline is also created but it gives an error when i run START PIPELINE command -

SQL Error [1934] [HY000]: (conn=16239147) Leaf Error (10.XX.XX.XXX:3306): Leaf Error (10.XX.XX.XXX:3306): Cannot extract data for pipeline. open /app/inbound/books.txt: no such file or directory.

There are two questions -

  1. How to resolve this issue, does the file need to be available on each leaf node as well.
  2. This is related to the design part. Is pipeline recommended for a scenario where let’s say we receive around 250 batch files (All with different schemas and format) daily and have to load them into their respective database tables (250 tables - one for each file) ?
    Note : These 250 files are part of batch load and have to be loaded once a day during the batch night load process. We might have to do some transformations on some of the fields coming in the files.
    Also, These files are combination of fixed width files and pipe delimited files.

Thanks

1 Like
  1. Yes, as written, the files must be available to all leaves. Pipelines are in general designed around having multiple files available at once and will by default seek to do all work in parallel across SingleStore partitions. And FS pipelines in particular are designed around NFS mounts visible to all leaves.

For that exact query, I’d recommend using LOAD DATA. It’s designed for single-file loads. I describe some advantages below. You could also add the AGGREGATOR keyword to that CREATE PIPELINE, and in that case making it accessible to the MA will suffice. In an AGGREGATOR pipeline, the MA will handle downloading and parsing the file, forwarding parsed rows to leaves for final insert in parallel.

  1. If making those files available to leaves isn’t in the cards and you don’t need pipelines-only features like script transforms or stored procedure integration, I’d also recommend LOAD DATA for the batch load you describe. It’s currently optimized to push a bit more work down to leaves compared to AGGREGATOR pipelines, it lets you at least manually parallelize aggregator work by connecting to different child aggs, and (with LOCAL INFILE) lets you stream files over a SQL connection rather than copying them to disk on singlestore hosts. Setting up NFS and using non-aggregator pipelines would likely perform best of all, but with 250 files a day it’s likely not worth the effort.

Note also that (1) you can perform SQL transformations on incoming data in the SET clause of LOAD DATA, which can hopefully suffice here and (2) you’ll want a separate LOAD DATA query (or pipeline, if you go that way) for each destination table.

1 Like