Pipeline DDL and Dynamic Filename as input

Hi Team,
Whenever we need to add a new column or modify the column in the pipeline, we have to

  1. Open the PIPELINE DDL, do the required changes
  2. Drop the pipeline or replace the pipeline with create or replace command
  3. We are currently hard coding for PIPELINE DATASURCE FILE with either pattern or specific name, instead of it is there any alternate way to pass the pipeline data source file name as parameter to invoke the pipeline.
    The issue is whenever we do drop and recreate pipeline, it picks up all the files with the File Data source pattern that we are specifying, it is corrupting the data to the max extent and becoming difficult to trace out.

We are exercising utmost caution by archiving the previously processed files and some instances any residue files are available in the data source path, when we recreate the pipeline and then trigger the job, it’s picking up all the files in the path.

I would need help on :

  1. Do we have the mechanism to pass the data source file name as dynamic one as parameter)?
  2. Or do we have the way to just alter the pipeline without drop and recreate when we need to modify some of the elements of the pipeline DDL?
  3. Is there any metadata where we can see by running a query or some other mechanism to see : pipeline and filename info without opening the pipeline DDL?

I know that once we run the pipeline we can get the processed file name by querying pipelines_files meta data table.

we want to ensure that there should be one-to-one mapping between the pipeline and data source filename .

I really appreciate if I get the solution for the above points.

With regards,
Prathap Reddy G

Hi Prathap,
Thanks for the detailed explanation. I believe that our ALTER PIPELINE functionality can help solve the issue you’re facing currently, as there is no need to drop and recreate pipelines. You could for example, SET OFFSETS to ignore any files that have already been loaded. There are multiple information schema tables containing pipeline metadata, they can be found in our pipelines docs. Please take a look at the docs and let me know if you have additional questions.

Hello Prathap,
To get the current pipeline definition (that include the filepath/pattern), you may want to try SHOW CREATE PIPELINE SQL command.

Regards.
mkheir

This is not our query. The query is whenever we create pipeline after dropping it, it picks up all the files that are available in the pipeline DDL file path. how can we control that.

this won’t help to our resolve our issue and in Prod we don’t have the permissions to change as per of wishl

In your CREATE PIPELINE, you could add logic such as a WHERE clause. Additionally, you could pipeline into a stored proc to check against the table for records existence before inserting.

For pipeline source, you have a couple options.

ex1:
CREATE PIPELINE testing1
AS LOAD DATA S3 ‘test-bucket/mypath/file.csv’

ex2:
CREATE PIPELINE testing2
AS LOAD DATA S3 ‘test-bucket/mypath*’