File system pipeline into SP for multiple tables

anandh.nagarajan · May 15, 2020, 5:56pm

I currently have a File System Pipeline which uses stored procedure to do upsert in a table. After the upsert is done , i am deleting the old entries in the table which is not available in the new file based on a timestamp set at the beginning of the SP. Also i am inserting few columns to another table.

What happens if the pipeline fails after the upsert is successful and before delete is executed. Would only the upsert be available in memsql.
If i insert values into another table after the first table upsert, what will happen if the pipeline fails between table 1 and table 2.

I guess my basic question is whether the whole SP is considered as single unit of work ? Or do we need to handle the failure scenarios.

sasha · May 15, 2020, 8:00pm

By default, all queries in a pipeline’s stored procedure run in the context of an implicit multi-statement transaction managed by MemSQL. We manage it for the sake of exactly once semantics - it’s the same transaction which we use to update the pipeline’s internal metadata about which files are loaded vs unloaded. Writes in the stored procedure won’t become visible outside the SP until the whole batch completes successfully and we commit the transaction. If any part of the SP fails, they’ll all be rolled back and the file will remain in an unloaded state.

So yes, a single unit of work with no need for manual cleanup.

anandh.nagarajan · May 15, 2020, 8:17pm

Sasha,

Thanks for the response. Just a quick follow up.

Which data is returned to the queries during the pipeline stored procedure execution. Since you mentioned that the changes are committed only after the entire batch is successful, is the updated data available to query only after the pipeline finishes?

Or the new updated data is visible to the queries during the SP process but will be rolled back if the pipeline fails ?

sasha · September 24, 2020, 9:16pm

Apologies for the delayed response.

Queries issued by a given invocation of the stored procedure will see the effects of earlier writes in that same invocation but those writes will be invisible to all other queries until the invocation finishes and implicitly commits or rolls back.

So the exact answer to question two depends on which queries you mean - for queries inside that SP invocation: yes, the updated data is visible to them though it’s not committed (that’s by design - any further effects “derived from” seeing the updates will roll back if the updates roll back). Otherwise: no, queries outside the SP or in other concurrent invocations won’t see them. No matter who sees what, all rows written by an SP that ultimately throws an exception will get rolled back and cleaned up automatically.

The particular implementation is multi-version concurrency control, if that helps model it (I saw the email you sent with more details)

anandh.nagarajan · September 25, 2020, 1:28pm

Sasha,

Thanks for the reply. This confirms that the updates are visible only to the pipeline stored procedure that is making the upsert until pipeline is finished.

Another follow up question. If there are updates (from another channel like Kafka for real time updates) coming in while this pipeline stored procedure is upserting the data for a same row, how that updates would be handled.

1.Will the Kafka update wait for the pipeline update to be over and then applied ?

Or Will it be applied to the data that is currently visible to the outside world ?? if this is the case, then it is possible that the real time update would be overwritten the pipeline update.

sasha · September 25, 2020, 4:01pm

The kafka update will indeed wait for the pipeline update to be over before applying its change. We use per-row locks taken by writes only and held until commit time, I believe mainly to ensure that our semantics are 1 here rather than 2. So a second write to the same row will wait, but reads of that row or writes to a different row won’t.