Support for truncating data by sort key

Consider a table like:

CREATE TABLE `log` (
  `id` bigint(20) unsigned NOT NULL,
  `timestamp` datetime NOT NULL SERIES TIMESTAMP,
  `daytimestamp` as date_trunc('day', timestamp) PERSISTED datetime,
  `some_id` bigint(20) unsigned NOT NULL,
  `some_other_id` bigint(20) unsigned DEFAULT null,
  `text` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
  `customer_id` int(10) unsigned NOT NULL,
  SHARD KEY (`some_id`),
  KEY (`customer_id`,`daytimestamp`) USING CLUSTERED COLUMNSTORE,
  UNIQUE KEY (`customer_id`,`id`, `some_id`) USING HASH
);

Each customer stores millions (sometimes hundreds of millions) of logs. Every now and then a customer leaves and we need to delete their data.

Currently we’re doing the delete in batches of 50k, to try and avoid too much lock contention on the table, but it’s painfully slow:

delete from log where customer_id = @customer_id limit 50000

This lead us to wondering why it’s not possible to truncate by sort key. By defining the sort key, we tell SingleStore that the data should always be sorted by the customer_id first, hence we would believe that the data is stored according to this in some manner. So being able to saying something like:
truncate table log where customer_id = @customer_id would be really beneficial. The statement should of course require that the constraint is in order of the sort key.

We’re also using SingleStore Cloud DB, so a lot of the data is in the cold storage, where it would be way more beneficial to simply drop the objects in the cold storage, without first pulling them into hot storage, flagging the rows for deletion and then persisting those changes to cold storage. I’m guessing that’s how it works, but I might be off :sweat_smile:

If this is already doable, then I apologize. Any pointers on how to best do bulk deleting without causing too much lock contention is also greatly appreciated.

Sorry it took so long to get back to you. This is an interesting request. I know that we can delete data very fast. How many rows do you have per customer when you delete one? What other kind of concurrent access is going on at the same time you do the deletes?

This blog gives an example of fast delete (116 million rows in .2 sec).

It might be an interesting exercise to look at a PROFILE plan for a DELETE with and without the LIMIT on it. What happens when you don’t have the LIMIT? Another way to batch it might be by (customer,date range)