Improving GROUP BY memory usage on a columnstore table

erica.anderson · July 8, 2022, 7:26pm

What are the best practices for minimizing memory usage for GROUP BYs on columnstore tables – is there any way to do so? Does the sort key make a difference?

hanson · July 11, 2022, 11:44pm

This is a really general question. The best way is to have fewer groups or fewer total columns in the group by + aggregate lists. Or group by smaller data types, like int instead of bigint.

Also, we do support hash group-by spilling (to disk) during query execution now.

Merge join can take less space, but it can be slower, because it can’t use operations on encoded data.

Why do you ask?

erica.anderson · July 12, 2022, 8:17pm

Ah, so it’s not super straightforward. I have noticed that the fewer columns in the group by, the better. Glad that SingleStore doesn’t require every non-aggregated value to appear in the group by.

I’m asking because I have a use-case in which such an aggregation is performed in one of my org’s apps approximately every hour. I’m trying to identify ways to minimize resource usage… I can modify the underlying structure of this table however I see fit, and tailor it to this use-case, so I was trying to see what changes might make sense. The columnstore table in question has ~22M rows which results in ~15M records in this aggregation. It takes 14.5 GB memory to perform the Hash Group By as-is.

We’re on 7.5.8 - is the spilling option available in this version, or was that introduced later?
Thanks!