What's the most efficient way to delete millions of rows in MemSQL?

mpskovvang · January 27, 2020, 8:27pm

I’ve always been using smaller transactions when deleting (and writing) data to SQL databases.

Is it also best practice with MemSQL? Any differences between Rowstore and Columnstore tables?

What’s the most efficient way to delete e.g. 1,000,000 rows from a Columnstore table?

A single DELETE statement?
Multiple DELETE statements with a LIMIT clause of e.g. 10,000 rows?
A single DELETE statement with autocommit disabled?
Other?

In case delete batches is the recommended way, what’s the best way to determine the LIMIT? Evaluate based on CPU usage?

nikita · January 27, 2020, 8:38pm

Deletes are pretty efficient. Are you experiencing any issues? And if so what are they?

mpskovvang · January 27, 2020, 8:53pm

@nikita Not at all. I was just curious if MemSQL had any different recommendations.

Would you just delete ~900 million rows in a single statement?

My first thought was to use multiple DELETE statements with a LIMIT of 10,000 rows each with autocommit enabled (default).

The smaller transactions has always been advised when I started working with SQL databases. Probably because of smaller “chunks” in the transaction logs?

nikita · January 27, 2020, 11:18pm

I would try to delete everything and if it takes too long use LIMIT

mpskovvang · January 28, 2020, 8:17pm

1 million rows at a time was apparently too much. The two 4 vCPU leafs increased to 95-100% CPU usage and caused all queries to run slow.

It took 4+ hours after stopping the DELETE statements before the CPU returned to the normal. Why is this? Some background mergers?

Actually, after watching the CPU usage from the yesterday’s 10,000 rows limit, the situation seemed to be the same. Increased CPU usage around 80-100%. No users was using the system at the time, so not sure how it affected queries.

evan · January 28, 2020, 10:07pm

Which version of memsql? What values do you have for

select @@columnstore_disk_insert_threshold;
select @@columnstore_segment_rows;

Cheers

mpskovvang · January 29, 2020, 7:03am

Version: 7.0.10

columnstore_disk_insert_threshold: 0.500000
columnstore_segment_rows: 1024000

Would it be useful if I send you a cluster report?

Disk I/O: 200-250 MiB/s (write) in the high CPU period.

franck.leveneur · January 31, 2020, 10:05pm

Technique I have used in the past when the amount of data to delete is greater than the data I want to keep is to insert the data i want to keep into a new table and swap

mpskovvang · February 1, 2020, 5:38am

Thanks for the suggestion, but in this case I only wants to delete ~200 million rows from a table with ~3.5 billion rows.

sproksell · February 3, 2020, 1:51pm

Sometimes I’ll use a status flag to remove rows from consideration in transactions, then just schedule a job to periodically delete 50-100k at a time throughout the day. Minimizes impact.

For more time sensitive deletes, I wrote a simple looping delete procedure where you pass in the table and the LIMIT and it just loops until the tables is empty. I suppose you could add an additional parameter to drive a WHERE statement. It is dynamic sql. I would really like the ability to add a “SLEEP” call between the loops to let the background processes catch up.

I don’t write the prettiest code, so be kind.

call LoopingDelete (‘schema.table_name’,30000)

DELIMITER //
CREATE PROCEDURE LoopingDelete(p_tablename varchar(100) , pn_delete_limit bigint(21) ) RETURNS void AS
Declare
v_delete_limit text = cast(pn_delete_limit as char);
q_delete_rows text;
q_record_counts text;
v_cnt bigint(21);
BEGIN
#####Define Dynamic Query#######
q_delete_rows = 'Delete from '|| p_tablename || ’ limit ’ || pn_delete_limit;
q_record_counts = ’ select count(1) cnt from '||p_tablename;

Define Table Count

v_cnt = scalar(q_record_counts, query (cnt bigint(21)));

while v_cnt > 0 LOOP
EXECUTE IMMEDIATE q_delete_rows;
commit;
v_cnt = scalar(q_record_counts, query (cnt bigint(21)));
END LOOP;
END;
END //
DELIMITER ;

kkoch · February 20, 2020, 10:05pm

TRUNCATE if you need to delete everything or the looping method with a limit seems to a sufficent fallback plan.