TPC-CH Query 13 and Query 20 don't work for MemSQL

nhas9102 · June 6, 2020, 3:07pm

Hi,

I’m trying to benchmark my MemSQL setup using TPC-CH queries found here.

I’m unable to run Queries 13 and 20.

13

select	 c_count, count(*) as custdist
from	 (select c_id, count(o_id)
	 from customer left outer join orders on (
		c_w_id = o_w_id
		and c_d_id = o_d_id
		and c_id = o_c_id
		and o_carrier_id > 8)
	 group by c_id) as c_orders (c_id, c_count)
group by c_count
order by custdist desc, c_count desc

20

select	 su_name, su_address
from	 supplier, nation
where	 su_suppkey in
		(select  mod(s_i_id * s_w_id, 10000)
		from     stock, orderline
		where    s_i_id in
				(select i_id
				 from item
				 where i_data like 'co%')
			 and ol_i_id=s_i_id
			 and ol_delivery_d > '2010-05-23 12:00:00'
		group by s_i_id, s_w_id, s_quantity
		having   2*s_quantity > sum(ol_quantity))
	 and su_nationkey = n_nationkey
	 and n_name = 'Germany'
order by su_name

Any help in being able to execute these queries would be appreciated. The log for each are as follows -

13

ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘(c_id, c_count)
group by c_count
order by custdist desc, c_count desc’ at line 8

20

ERROR 1713 (HY000) at line 1: The query cannot be executed. MemSQL does not support this type of query: unsupported nested scalar subse
lects

I’m using version - 7.1.2

jack · June 6, 2020, 3:55pm

Haven’t had a chance to look into q20 yet, but for q13, you just need to use a different syntax to specify a column alias c_count:

select	 c_count, count(*) as custdist
from	 (select c_id, count(o_id) as c_count
	 from customer left outer join orders on (
		c_w_id = o_w_id
		and c_d_id = o_d_id
		and c_id = o_c_id
		and o_carrier_id > 8)
	 group by c_id) as c_orders
group by c_count
order by custdist desc, c_count desc

nhas9102 · June 6, 2020, 4:31pm

Cheers! Thanks for the update jack

nhas9102 · June 7, 2020, 10:24am

Hi Jack,

Any update on query 20? Thanks!
Nahian

evan · June 8, 2020, 5:10pm

I’m able to repro your failure for query 20 with the below.

Looks like the where s_supplier in gets pushed down as filter into the subselect. The error message roughly means our distributed planner isn’t sending that subselect to the right leaf nodes - the correlated subselect will need a s_supplier value from another leaf node.

I think there is likely some rewrite or table schema tweak to make this work, though I am not sure offhand what that might be.

drop all from plancache;
drop database if exists db;
create database db;
use db;

CREATE TABLE supplier (
	S_SUPPKEY		SERIAL PRIMARY KEY,
	S_NAME			CHAR(25),
	S_ADDRESS		VARCHAR(40),
	S_NATIONKEY		BIGINT NOT NULL, -- references N_NATIONKEY
	S_PHONE			CHAR(15),
	S_ACCTBAL		DECIMAL,
	S_COMMENT		VARCHAR(101)
);

CREATE TABLE nation (
	N_NATIONKEY		SERIAL PRIMARY KEY,
	N_NAME			CHAR(25),
	N_REGIONKEY		BIGINT NOT NULL,  -- references R_REGIONKEY
	N_COMMENT		VARCHAR(152)
);

create table stock (
    s_i_id int not null, 
    s_w_id int not null, 
    s_quantity smallint, 
    s_dist_01 char(24), 
    s_dist_02 char(24),
    s_dist_03 char(24),
    s_dist_04 char(24), 
    s_dist_05 char(24), 
    s_dist_06 char(24), 
    s_dist_07 char(24), 
    s_dist_08 char(24), 
    s_dist_09 char(24), 
    s_dist_10 char(24), 
    s_ytd decimal(8,0), 
    s_order_cnt smallint, 
    s_remote_cnt smallint,
    s_data varchar(50),
    PRIMARY KEY(s_w_id, s_i_id),
    shard key(s_w_id)
);


-- page 16 of http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf

create table orderline(
    ol_o_id int,
    ol_d_id int,
    ol_w_id int,
    ol_number int,
    ol_i_id int,
    ol_supply_w_id int,
    ol_delivery_d datetime default null,
    ol_quantity numeric(2),
    ol_amount numeric(6, 2),
    ol_dist_info char(24),
    primary key(ol_w_id, ol_d_id, ol_o_id, ol_number)
);


create table item (
    i_id int not null, 
    i_im_id int, 
    i_name varchar(24), 
    i_price decimal(5,2), 
    i_data varchar(50),
    PRIMARY KEY(i_id)
);

select	 s_name, s_address
from	 supplier, nation
where	 s_suppkey in
		(select  mod(s_i_id * s_w_id, 10000)
		from     stock, orderline
		where    s_i_id in
				(select i_id
				 from item
				 where i_data like 'co%')
			 and ol_i_id=s_i_id
			 and ol_delivery_d > '2010-05-23 12:00:00'
		group by s_i_id, s_w_id, s_quantity
		having   2*s_quantity > sum(ol_quantity))
	 and s_nationkey = n_nationkey
	 and n_name = 'Germany'
order by s_name;

-- Found correlated subselect that cannot be rewritten:
-- (`supplier`.`s_suppkey` in 
--     ( 
--     select distinct ((`stock`.`s_i_id`*`stock`.`s_w_id`)%?) as `mod(s_i_id * s_w_id, 10000)`
--     from            ((`?`.`stock`   as `stock` 
--     join            `?`.`orderline` as `orderline` ) 
--     join            `?`.`item` as `item` ) 
--     where           ((`stock`.`s_i_id` = `item`.`i_id`)
--                     and (`supplier`.`s_suppkey` = ((`stock`.`s_i_id`*`stock`.`s_w_id`)%?))
--                     and (`orderline`.`ol_i_id` = `stock`.`s_i_id`)
--                     and (`orderline`.`ol_delivery_d` > (?!:>datetime(6) null))
--                     and (`item`.`i_data` like ?))
--     group by        `stock`.`s_i_id`, 
--                     `stock`.`s_w_id`, 
--                     `stock`.`s_quantity` 
--     having          ((?*`stock`.`s_quantity`) > sum(`orderline`.`ol_quantity`)) )
-- )

evan · June 8, 2020, 8:38pm

i believe this is the expression that prevents us from running the query

jack · July 7, 2020, 8:35pm

This is a similar but slightly different query as in your other thread, and it is affected by the same issue - copying what I wrote there:

We investigated this and found that a new optimization added in 7.1 accidentally prevented us from being able to run this query shape. The issue was for certain shapes of IN subselects. The query runs fine on 7.0 and earlier versions, and we have made a fix to enable running the query again which will be available in future versions of MemSQL.

The expression noted above is also part of the reason memsql wasn’t able to execute this query, and another potential workaround would have been to create a computed column storing that expression into a field.

bharti.anup · July 8, 2020, 12:43pm

Hello Evan,

Even i am also doing TPC-H benchmark test for MemSQL database.

On the same TPC-H queries i have question with respect to loading the data.
I followed following link for creation of schema ,generation of data and and running queries .

Here is the link

I was successfully able to generate the data but while loading the data i am getting an error for load data local infile to the lineitem table.
please find the error

ERROR 1261 (01000): Leaf Error (X.X.X.X:3306): Row 192 doesn’t contain data for all columns

I have exactly followed the procedure given in the above link. so when i identify the data and column there is an extra | at the end of the .tbl file generated by dsgen.

Any idea how to solve this ? i was able to identify the problem and solved it for scale factor of 1 GB but doing it for large scale is a problem

Please suggest .

jack · July 21, 2020, 5:20pm

That is a separate question and I would advise opening a new thread for it so people can see it better.

Note that this thread is about TPC-CH which is different than TPC-H and not an official TPC benchmark.