Getting smaller dataset of TPC-H using pipelines

Is there a chance that the TPC-H dataset sf_10 is available at the same S3 Bucket? ```

CREATE OR REPLACE PIPELINE tpch_100_orders
AS LOAD DATA S3 ‘memsql-tpch-dataset/sf_100/orders/’ (replacing sf_100 with sf_10)
config ‘{“region”:“us-east-1”}’
SKIP DUPLICATE KEY ERRORS
INTO TABLE orders
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘|\n’;

For instance, sf_10 instead of sf_100. I am interested in pulling smaller dataset from the S3 bucket. I am using TPC-H dataset that i got from the official site and df_10 is a smaller option. My question is if there are multiple instances of these pipline datasets for smaller versions.

Welcome jkricej :wave: Are you on the managed or self-hosted service and what version are you running?

Selfhosted. I am using Cluster in a box. For my Bachelors degree i am comparing TPC-h Data set and its queries between MariaDB (SQL) and SingleStore (NewSQL) to show that the SingleStore is better and newer alternative. Because i have problems creating dataset of 100gb of lines for MariaDB i thought to lower the size to 10gb which seems more doable. Therefore, i would need a sf_10 dataset of TPCH-H on the S3 bucket. Does it exist ? Thank you so much for answering.

Hi Jkricej! :sun_with_face:

It is possible to replace the sf_100 with sf_10 to get the dataset at a scale factor of 10 GB. We also have sf_1000 in addition to sf_10 and sf_100.

For example: replace memsql-tpch-dataset/sf_100/orders/ with memsql-tpch-dataset/sf_10/orders/

Please confirm this resolves your issue. Thank you! Wishing you the best on your class project. :raised_hands:

Sadly, I already used this example. Because i saw the different types of datasets when generating data for MariaDB. I am running pipeline using sf_10 instead of sf_100 for a day and a half and it doesn’t seem to generate any data (recreated all the pipelines as you suggested). I will provide screenshot of the running pipelines and the database tpch. When i started the pipeline sf_100 i got batch reports and tables started to fill up. But that is not the case when i change the dataset to sf_10.



I would kindly ask for any workaround. I really don’t know why the 10GB dataset is not working. If you need any contact information, please let me know. I am willing to work with your support. Thank you so much for taking care and aswering so fast.

The issue is sadly still not resolved. Is there anyone else i can contact? Would really appreciate if i somehow get this 10GB dataset. I am from Central Europe.

Hi Jkricej! :sun_with_face: Sorry to hear that this is not working for you. I have escalated it internally and will have an update for you soon. Thanks for your patience.

Thank you for not forgetting about my issue.
If you need any help in regards to testing feel free to contact me.

1 Like

The latest situation with pipelines.

Thanks for the update Jkricej! I have passed this along to the team and they are still working on it. Your patience is appreciated, and we hope to have a resolution for you soon! :pray:

1 Like