MAX_BY function

Hi,

It would be great if SingleStore had the MAX_BY/MIN_BY aggregation functions, as it’s a very common function used in OLAP.

Any suggestions around the best performant workaround to achieve what MAX_BY/MIN_BY do with the current functions supported by SingleStore, in a Columnstore table?

Thanks!

Hi, We currently have the FIRST and LAST builtin aggregate functions: https://docs.singlestore.com/v7.1/reference/sql-reference/time-series-functions/first/. Would this work for your use case?

@Keneck I think FIRST and LAST can do what you need as long as your second (ordering) argument is a datetime or timestamp. Will that work for you?

See https://docs.singlestore.com/v7.1/reference/sql-reference/time-series-functions/first/

Can you share an example of MAX_BY and MIN_BY used in one of your queries?

HI @Keneck, do you could answer my question just above?

@rob @hanson Yes, whenever the ordering argument is a datetime or a timestamp, the FIRST and LAST functions work great. However, if you have a table like the one below, that stores temperature readings of multiple devices, what is the most efficient query that could be used to obtain, for each device, the dt when the maximum and minimum temperatures were recorded?

CREATE TABLE `temperatures` (
  `device_id` int,
  `dt` datetime,
  `temperature` float
);

Thanks!

Try something like

CREATE TABLE `temperatures` (
  `device_id` int,
  `dt` datetime series timestamp,
  `temperature` float
);

select device_id, first(temperature), last(temperature), 
  min(temperature), max(temperature), 
  min(dt) as first_dt, max(dt) as last_dt
from temperatures
group by device_id;

Using min and max on dt can give you the first and last dt values for the device_id you are grouping by.

Oh, I see you are asking a different question. You could do it with a user-defined aggregate that keeps around the time when the min and max temperatures were recorded and includes that in the output.

Alternatively, you could find your min and max temps for each device in a CTE, and then join on temperature to get the corresponding dt value.

And I see this description of MAX_BY on the web.

I opened an internal feature request for MIN_BY and MAX_BY to track this.

Here’s how you could do it with a user-defined aggregate.

DELIMITER //
CREATE FUNCTION max_by_init() RETURNS 
    RECORD(dt DATETIME(6), val DOUBLE) AS
  BEGIN
    RETURN ROW(NULL, NULL);
  END //
DELIMITER ;
DELIMITER //
CREATE FUNCTION max_by_iter(
    state RECORD(dt DATETIME(6), val DOUBLE),
    dt datetime(6),
    val DOUBLE) RETURNS RECORD(dt DATETIME(6), val DOUBLE)
  AS
  BEGIN
    RETURN ROW(
       if(val > state.val, dt, state.dt),
       if(val > state.val, val, state.val));
  END //
DELIMITER ;
DELIMITER //
CREATE FUNCTION max_by_merge(state1 RECORD(dt DATETIME(6), val DOUBLE), 
                             state2 RECORD(dt DATETIME(6), val DOUBLE)) 
    RETURNS RECORD(dt DATETIME(6), val DOUBLE) AS
  BEGIN
    RETURN row(if(state1.val > state2.val, state1.dt, state2.dt),
               if(state1.val > state2.val, state1.val, state2.val));
  END //
DELIMITER ;
DELIMITER //
CREATE FUNCTION max_by_terminate(state RECORD(dt DATETIME(6), val DOUBLE)) 
    RETURNS DATETIME(6) AS
  BEGIN
    RETURN state.dt;
  END //
DELIMITER ;

CREATE AGGREGATE max_by(DATETIME(6), DOUBLE) RETURNS DATETIME(6)
  WITH STATE RECORD(dt DATETIME(6), val DOUBLE)
  INITIALIZE WITH max_by_init
  ITERATE WITH max_by_iter
  MERGE WITH max_by_merge
  TERMINATE WITH max_by_terminate;

-- test
create table ts(d datetime(6), v double);

insert ts values(now(6), 1);
insert ts values(now(6), 2);
insert ts values(now(6), 3);
insert ts values(now(6), 2);
insert ts values(now(6), 1);

select * from ts order by d;

+----------------------------+------+
| d                          | v    |
+----------------------------+------+
| 2021-01-04 16:39:52.411670 |    1 |
| 2021-01-04 16:39:57.351114 |    2 |
| 2021-01-04 16:40:04.794208 |    3 |
| 2021-01-04 16:40:13.367547 |    2 |
| 2021-01-04 16:40:19.969650 |    1 |
+----------------------------+------+

select max(v), max_by(d, v)
from ts;

+--------+----------------------------+
| max(v) | max_by(d, v)               |
+--------+----------------------------+
|      3 | 2021-01-04 16:40:04.794208 |
+--------+----------------------------+