Optimizing query with where and order by desc

miguilim · September 6, 2022, 3:43am

I’m currently performing a query in a table of 10M rows. Even with the sort key and indexes, the following query still scans half of the table to find the result:

select `created_at` from `completed_tasks` where (`user_id` = 1 or `ip_address` = '127.0.0.1') order by `created_at` desc limit 1;

This is the table schema:

CREATE TABLE `completed_tasks` (
  `id` char(36) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `user_id` bigint(20) unsigned NOT NULL,
  `ip_address` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `task_id` bigint(20) unsigned NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  KEY `completed_tasks_user_id_ip_address_index` (`user_id`,`ip_address`) USING HASH,
  KEY `created_at` (`created_at`) USING CLUSTERED COLUMNSTORE,
  UNIQUE KEY `PRIMARY` (`id`) USING HASH,
  KEY `completed_tasks_user_id_index` (`user_id`) USING HASH,
  SHARD KEY `__SHARDKEY` (`id`)
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES,NO_AUTO_CREATE_USER'

Profile screenshot:

MariaSilverhardt · September 6, 2022, 6:34pm

Hi Miguilim!

So sorry to hear that you are having this issue. We’re happy to help.
Can you share whether you are running on self-hosted or managed service and what version?

miguilim · September 6, 2022, 10:30pm

Hello Maria,

I’m currently running a self-hosted version 7.8.13.

rimbert · September 7, 2022, 4:22pm

Hi Miguilim !
The engine will always need to scan a certain number of rows because it rely on the segment size of the columnstore table. You can reduce segment size but it’s not a good recommendation to do it because your data will have less compression rate. Otherwise here is some documentation about how you can reduce segment size to scan less rows : Configuring the Columnstore to Work Effectively · SingleStore Documentation

Regarding the query itself, I would recommend to create an hash index on ip_address too.
Columnstore segments are usually skip regarding the shard key , you may consider changing the shard key if you really want to achieve high performance on this specific query.
Also if you click on the operation ColumnstoreScan you will see how many segments have been skipped. You can also use EXPLAIN at the beginning of your query to see how and which indexes have been used to optimise your query.

What is the speed of the current query and what speed do you want to achieve for this particular query ?

Hope it helps !

miguilim · September 7, 2022, 6:35pm

Hi rimbert,

Thanks for the support! I’ve changed the primary key to id,user_id, shard key to user_id and created an index to ip_address. I also changed the sort key created_at to DESC.

Now it’s working way better, but it is still not ideal. This is a very frequent query that should be optimized in the best possible way.

I tried changing the columnstore_segment_rows variable to a lower value (now it’s set to 102400), but it didn’t help at all. The query just increased the number of segments scanned.

Visual Explain with all those changes:

arnaud · September 8, 2022, 4:00am

We could help you further if you send us a csv file (with anonymized data) and your latest schema for that table. Feel free to shoot me an email at managed@singlestore.com . We will reply with help on this forum to make sure everyone can leverage that solution.

miguilim · September 8, 2022, 5:40am

Hello arnaud,

Thanks for such help on this thread!

Sorry but I didn’t understand what you mean by the CSV file, you mean a table dump? If so, it’s about 10M rows. What is the best to send you this through email?

About the schema, here is the latest version (the latest Visual Explain is using this schema):

CREATE TABLE `completed_tasks` (
  `id` char(36) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `user_id` bigint(20) unsigned NOT NULL,
  `ip_address` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `operation_id` bigint(20) unsigned NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  UNIQUE KEY `PRIMARY` (`id`,`user_id`) USING HASH,
  KEY `created_at` (`created_at` DESC) USING CLUSTERED COLUMNSTORE,
  KEY `completed_faucet_claims_ip_address_index` (`ip_address`) USING HASH,
  SHARD KEY `__SHARDKEY` (`user_id`)
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES,NO_AUTO_CREATE_USER';

arnaud · September 8, 2022, 5:54am

If you have a Google Drive / One Drive Link that you can share by email, it will be great. 10m rows will probably not fit in one email.

And CSV was an example but any format that you used to ingest the data will work.

smargareto · September 8, 2022, 5:59am

Hi,
Maybe try changing the sort key to (user_id, ip_address) so the data is already sorted by user_id and the filter should be really fast.
BR

miguilim · September 8, 2022, 6:40am

Hi arnaud, just sent you the email with the table dump!

ganesh.usi.gk · September 8, 2022, 1:13pm

Hi @miguilim

Your query migth be like this.
Please check whether it is satisfied your expected output.

select MAX( created_at) from completed_tasks where (user_id = 1 or ip_address = ‘127.0.0.1’)

Note: And also check the below one

select count(1) from completed_tasks where (user_id = 1 or ip_address = ‘127.0.0.1’) .

If this query returns more than 5M rows ,then definitely it will scan half of the table which is expected.

miguilim · September 9, 2022, 6:29pm

Hello @smargareto, thanks for the suggestion! This approach doesn’t make much sense to me. I’ve tried with this sort key and setting the shard key back to ID, and now the query is scanning the entire table.

Hello @ganesh.usi.gk , I hadn’t thought about this kind of approach. Really interesting! But I’ve tried with it, and it’s scanning the same amount and working exactly as the default one.

About the count one, the query is returning only 7. This was already expected, it’s only a small portion of data per user!

smargareto · September 12, 2022, 5:34am

My bad, I thought it was an AND between conditions, not an OR…