Parallel Read using singlestore spark connector

sairohith2012 · June 8, 2022, 10:18am

I am exploring options to do parallel read using singlestore spark connector, when I use the forced option, the spark job frequently fails as it starts before the min. number of executors to do parallel read are met. Hence, switched to automatic and using the parallel read feature: readFromAggregatorsMaterialized.

Does it store the output of the query (which is 30 million records for now and might increase exponentially in future) in-memory?

Need your help to understand the impact using this feature will have on the db… I am using a shared cluster and I don’t want to impact other jobs.

FYI: The query is to fetch all the records in the table and doesn’t include nested queries.
TIA

jprice · June 8, 2022, 8:27pm

Hi @sairohith2012,

Welcome to SingleStore forums! Thanks for your post.

Using readFromAggregatorsMaterialized will store the output of the query in-memory on the SingleStore cluster, so that is a concern with a dataset this large.

How many partitions does your SingleStore cluster have and how many executor CPUs (spark.executor.cores/spark.task.cpus) are available in Spark?

Finally, are you running SingleStoreDB Cloud or on-prem?

Cheers,
Julie

sairohith2012 · June 9, 2022, 4:06am

Hi Jprice,

Thanks for your reply, currently I am working on a DB with 16 partitions but it will increase to 112.

The problem with spark executors is, the spark job begins with partial capacity before it gets the total 16 cores and this causes the parallel read to fail. I’ve currently set the number of executor instances to 5 and cores per executor to 5… so total of 25 cores are available.

Thanks,
Rohith