Replicating one table's data to another after some processing on each row

wfaheem · September 17, 2020, 7:19am

Here is a simple scenario. We have a table T1 which is being ingested by pipeline. This table gets 30M data per second. Table has following schema (dummy)

area_code
username
activity_duration
timestamp

Pipeline just ingest activity of each user. whenever the area_code changes, we have to calculate the activity_duration of the username by formula (current_row_timestamp minus last_row_timestamp). We are not doing this while pipeline because inserts get slow as there is a huge data coming. On each new piepline replace into, calculating the activity_duration by getting last activity (order by username, area_code desc limit 1) from 30M rows is not an option i believe.

So plan B, creating a procedure which runs after every few hours on 30+M rows and calculate the activity_duration of each username+area_code and insert into another table. But this solution is taking too much for inserts. 100k rows in 10 minutes approx. Is there any better way to do this which we are missing? Thanks in advance

P.S we are using replace into syntax and using Execute immediate sql_statement; in our procedure

hanson · September 21, 2020, 5:01pm

Are you using a columnstore or rowstore? Rowstore sounds best for this kind of application.

What’s the actual CREATE TABLE statement for your dummy example? Do you have any other indexes?

If you want the highest performance for queries in SPs, use embedded SQL in the SP, not dynamic SQL (execute immediate).

REPLACE INTO just conditionally replaces row content or inserts a row, it can’t do calculations and update fields.

Check out the INSERT ON DUPLICATE KEY UPDATE syntax. It allows calculating new field values as expressions of both old and newly updated field values. It’s probably better than REPLACE INTO for this scenario.

wfaheem · September 23, 2020, 2:21pm

@hanson Thanks for the response. I am using rowstore

area_code varchar(20),
username varchar(64),
activity_duration  int,
primary key (area_code,username)

OK, i will try with embedded SQL.
REPLACE INTO vs INSERT ON DUPLICATE KEY UPDATE -> for instance i have calculated the duration and nothing to calculate at runtime for now, simple insert or update. What is the best choice here? Because REPLACE INTO is taking too much time inserting rows, currently 260k records in 20 minutes. i have millions to insert/update

The actual process is, we have table with billions of rows a day. Then there is an analytics table which has to be filled based on certain conditions on billion row table. We, at end of the day, have to iterate billion rows and insert/update millions meanwhile. So, we need to fast insert/update bcause current iteration and replace taking more than 24H to complete (estimated). what would be the efficient way of dealing this case? batching for instance?
Any memsql reading resource or direction for our above daily problem would be great. Thanks in advance

hanson · September 23, 2020, 4:43pm

How much hardware do you have (leaves, cores per leaf, total RAM per leaf?)

Make sure you have enough partitions so all the cores keep busy. If it is a low-concurrency workload (one heavy command running at a time), consider 1 partition for each core. Otherwise have 2 or 4 cores for each partition.

What is your actual command you are using that is the bottleneck (full text of the command)?

Check out this topic on query tuning:

and this on optimizing table structures:

Also, look at the visual profile of the statement you are trying to tune in Studio (click on the little icon that looks like a tree next to the Run button on the upper right of the screen). Make sure it’s not doing any unneeded shuffle or full scan operations (or other heavy operations) that you could avoid by tweaking the command or your sharding or indexes. The table is sharded on the same key as the primary key if you don’t specify sharding explicity.

If the area_code is a number, consider changing the data type to int, rather than varchar, because that will reduce the cost of comparisons (but this won’t make an order of magnitude difference for this particular situation if your key has to include both username and area_code).

wfaheem · September 24, 2020, 7:53am

Hardware: usually not show RAM going beyond normal. CPU sometimes reach +90%

4 leaves
16 cores (each)
16 partition (each)

The table with billion rows, we fetch first 2M rows(batching via limit, offset), process them and REPLACE INTO another table. The bottleneck is, while iterating 2M records and inserting some 1M records, it take around 1h20m minutes. So you can imagine what time it will take to complete 1B iterations and atleast 50M to 60M REPLACE INTOs.

Question 1: Does this process of iterating the whole table (processing each row, if criteria met do calculations and replace into another table) is the right approach?

Question 2: Is REPLACE INTO the right choice for insert/update here

Question 3: if this approach is fine, does increasing resource is the only solution?

P.S: while iterating 2M rows, we just calculate the duration and perform:

REPLACE INTO table_name (area_code,username ,activity_duration) values(area_code,username ,activity_duration);

Yes i know primary key default for shard key. I have no experience with profile yet. I would do that too.

Update:
Just tested, if i comment REPLACE INTO command: whole process takes just 60 minutes. Which fetch 55M rows → iterate over them, do calculation part etc.
REPLACE INTO is the bottleneck here i believe. We need fast insert/update.