HLL + Group By - Memory Usage

aaron.pfeifer · May 23, 2019, 5:10pm

Hi -

I have a use case where I am running a group by query on a large number of HLLs that is resulting in anywhere from 1M to 10M rows being returned (each row containing a combined HLL – the raw 16KB, not the cardinality). Since each HLL is stored in a dense layout of 16KB, this consumes a significant amount memory during the query. A query that generates 3M rows requires 50GB of memory to complete.

While supporting HLLs with a sparse layout would resolve this problem for us (our HLLs often have a low cardinality so the query would consume < 100MB), it’s not yet supported in MemSQL as far as I can tell.

Are there any options here to either reduce memory usage or have the query use disk for processing the HashGroupBy operation? I could theoretically split the query into multiple queries, but I was hoping to avoid that.

Effectively the query looks like so:

select approx_count_distinct_combine(approx_value)
from facts where
account_id = ... and metric_id = ... and approx_value is not null
group by ...;

The output of the profile is:

| Gather partitions:all alias:remote_0 actual_rows: 3,092,050 exec_time: 0ms start_time: 00:00:00.001 end_time: 00:01:13.058                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Project [`approx_count_distinct_combine(approx_value)`] actual_rows: 3,092,050 exec_time: 147ms start_time: [00:00:00.981, 00:00:01.780] network_traffic: 50,669,424.000000 KB network_time: 71,506ms                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| HashGroupBy [APPROX_COUNT_DISTINCT_COMBINE(facts.approx_value) AS `approx_count_distinct_combine(approx_value)`] groups:[...] actual_rows: 3,092,050 exec_time: 415ms start_time: [00:00:00.375, 00:00:00.943] memory_usage: 434,111.218750 KB |
| Filter [facts.account_id = ? AND facts.metric_id = ? AND facts.approx_value IS NOT NULL] actual_rows: 3,092,050 exec_time: 1ms start_time: [00:00:00.361, 00:00:00.443]                                                                                                                                                                                                                                                                                                                                                                                                                |
| ColumnStoreScan report_service.facts, KEY facts_sort_key (...) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=102400) actual_rows: 15,604,684 exec_time: 782ms start_time: [00:00:00.353, 00:00:00.425] memory_usage: 125,829.117188 KB segments_scanned: 1,221 segments_skipped: 9,164 segments_fully_contained: 0                                                                                                                                              |

I’ve seen this addressed in different databases by either forcing the query engine to use disk for processing the HashGroupBy or by using sparse HLL layouts. Would love to hear your thoughts.

Thanks!

hanson · May 24, 2019, 6:12pm

Your query has a GROUP BY clause but no grouping columns listed in the SELECT list, only the aggregate. Can you expand it some to make it closer to the actual query?

hanson · May 24, 2019, 6:15pm

My first thoughts are to partition the work, say by subsets or ranges of the grouping keys (which I guess you have considered). E.g. hash the grouping key into 10 buckets, and run 10 queries, 1 for each hash bucket.

Alternatively, use more hardware.

aaron.pfeifer · May 24, 2019, 6:28pm

Hey @hanson!

Thanks for the follow up. Yeah I excluded those by accident in the select query when I was writing up the comment. The actual query looks something like so:

insert into facts (
  account_id,
  app_id,
  epoch,
  metric_id,
  network_id,
  country_id,
  placement_id,
  ad_unit_id,
  zone_id,
  sdk_version_id,
  app_version_id,
  approx_value
)
select
  account_id,
  app_id,
  epoch,
  metric_id,
  network_id,
  country_id,
  placement_id,
  ad_unit_id,
  zone_id,
  sdk_version_id,
  app_version_id,
  APPROX_COUNT_DISTINCT_COMBINE(approx_value)
from facts_staging
where approx_value is not null
group by account_id, app_id, epoch, metric_id, network_id, country_id, placement_id, ad_unit_id, zone_id, sdk_version_id, app_version_id;

It’s really just the same one that I posted – I just forgot to add ... to the select portion of the query.

Your suggestion on partitioning the work is exactly what we’ve moved forward with. Using more hardware becomes a bit untenable – we’d 5-10x the cost of our cluster for a single query

It seems like realistically the long-term solution is to look into support for sparse layout representations of HLLs (or obviously change the application so it doesn’t need to do this type of aggregation). That would make such a huge impact on memory usage. I attempted to implement sparse layouts with UDFs, but the performance was just too poor – we’d really need C extension support to attempt that.

hanson · May 24, 2019, 8:11pm

Got it. Thanks for the feedback. I passed it on to our dev team.