Getting smaller dataset of TPC-H using pipelines

jkricej · June 29, 2022, 3:43pm

Is there a chance that the TPC-H dataset sf_10 is available at the same S3 Bucket? ```

CREATE OR REPLACE PIPELINE tpch_100_orders
AS LOAD DATA S3 ‘memsql-tpch-dataset/sf_100/orders/’ (replacing sf_100 with sf_10)
config ‘{“region”:“us-east-1”}’
SKIP DUPLICATE KEY ERRORS
INTO TABLE orders
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘|\n’;

For instance, sf_10 instead of sf_100. I am interested in pulling smaller dataset from the S3 bucket. I am using TPC-H dataset that i got from the official site and df_10 is a smaller option. My question is if there are multiple instances of these pipline datasets for smaller versions.

MariaSilverhardt · June 29, 2022, 11:41pm

Welcome jkricej Are you on the managed or self-hosted service and what version are you running?

jkricej · July 1, 2022, 7:18am

Selfhosted. I am using Cluster in a box. For my Bachelors degree i am comparing TPC-h Data set and its queries between MariaDB (SQL) and SingleStore (NewSQL) to show that the SingleStore is better and newer alternative. Because i have problems creating dataset of 100gb of lines for MariaDB i thought to lower the size to 10gb which seems more doable. Therefore, i would need a sf_10 dataset of TPCH-H on the S3 bucket. Does it exist ? Thank you so much for answering.

MariaSilverhardt · July 7, 2022, 8:54pm

Hi Jkricej!

It is possible to replace the sf_100 with sf_10 to get the dataset at a scale factor of 10 GB. We also have sf_1000 in addition to sf_10 and sf_100.

For example: replace memsql-tpch-dataset/sf_100/orders/ with memsql-tpch-dataset/sf_10/orders/

Please confirm this resolves your issue. Thank you! Wishing you the best on your class project.

jkricej · July 8, 2022, 6:52am

Sadly, I already used this example. Because i saw the different types of datasets when generating data for MariaDB. I am running pipeline using sf_10 instead of sf_100 for a day and a half and it doesn’t seem to generate any data (recreated all the pipelines as you suggested). I will provide screenshot of the running pipelines and the database tpch. When i started the pipeline sf_100 i got batch reports and tables started to fill up. But that is not the case when i change the dataset to sf_10.

I would kindly ask for any workaround. I really don’t know why the 10GB dataset is not working. If you need any contact information, please let me know. I am willing to work with your support. Thank you so much for taking care and aswering so fast.

jkricej · July 14, 2022, 6:57am

The issue is sadly still not resolved. Is there anyone else i can contact? Would really appreciate if i somehow get this 10GB dataset. I am from Central Europe.

MariaSilverhardt · July 14, 2022, 3:35pm

Hi Jkricej! Sorry to hear that this is not working for you. I have escalated it internally and will have an update for you soon. Thanks for your patience.

jkricej · July 15, 2022, 6:59am

Thank you for not forgetting about my issue.
If you need any help in regards to testing feel free to contact me.

jkricej · July 18, 2022, 6:48am

The latest situation with pipelines.

MariaSilverhardt · July 20, 2022, 12:34am

Thanks for the update Jkricej! I have passed this along to the team and they are still working on it. Your patience is appreciated, and we hope to have a resolution for you soon!