Who do I perform delta load from flat files

Hello,
We receive column delimited about 30 different large data files everyday from another system. These files are eventually loaded into MemSQL database and takes about couple of hours to finish. Many of the files have only 5% changes compared to the last day.

  1. How do I upload only the updated/new rows into the database? Would like to upload data only once. Currently we are using pipeline into stored procedure and doing upsert for the entire set of data. It takes a long time to finish. Not sure if the source DB supports row-level checksum.

Thanks,
Sid.

Can you do a regular upsert into a columnstore in 7.5, using multi-column key support? It’s not clear why you are using pipelines to SPs.

I’m not sure what you mean about row-level checksum. Can you clarify?

Here’s the doc reference “Performing Upserts”: SingleStoreDB Cloud · SingleStore Documentation

May be I can achieve this by adding a timestamp column with “ON UPDATE current_timestamp()”.

CREATE PIPELINE supports a REPLACE clause, which gives semantics equivalent to doing replace into dest_table select * from input, an ON DUPLICATE KEY UPDATE clause for insert into dest_table ... on duplicate key update semantics, and a WHERE clause to filter incoming rows from the insert. So I’d recommend two things:

  • Avoid 95% of insert-time key lookups in the destination table by using the WHERE clause to filter out incoming rows which haven’t been updated since the last loaded snapshot. That filter is applied to every row before we attempt to insert it. You’d need something like a timestamp or auto-increment column in the input files/tables for this.
  • Use REPLACE or ON DUPLICATE KEY UPDATE with a direct-to-table pipeline, rather than a pipeline into a stored procedure. Passing incoming data through a stored procedure adds overhead that’s worth avoiding when possible, and it seems like in this case it would be possible.

That would minimize the amount of work done for irrelevant rows. However, it wouldn’t save you from having to download and parse out irrelevant rows in order to figure out that they don’t match the filter. If it’s actually possible to write a “give me only recently modified rows” filter with your data, then you may be able to apply it when producing the input files in the first place, skipping the downloading and parsing as well.

If you have auto-updated timestamp column both in table and input file then you may further filter from existing table data to avoid unnecessary unchanged data.