Can't Connect with Spark Connector

benji · March 26, 2021, 3:47am

Hi There,

I am trying to connect to Singlestore from Apache spark in an offline environment but I am getting the following error:

java.lang.ClassNotFoundException: Failed to find data source: singlestore. Please find packages at http://spark.apache.org/third-party-projects.html

Since I am working in an offline environment I can’t use the spark packages option that points to Maven coordinates in an online repository, so instead I am listing the jars with the --jars parameter.

I have tried following the instructions here and here but I am still running into this error.

I have pasted the spark-shell code below which shows that the jars are registered with the spark session and that the mssql jar can be used successfully. Any ideas on what I am doing wrong? Thanks in advance for any help:

[root@hadoop-namenode spark-2.4.5-bin-hadoop2.7]# bin/spark-shell --jars /lib/jars/sqljdbc_6.0/enu/jre8/sqljdbc42.jar,/lib/jars/memsql-spark-connector_2.11-3.0.5-spark-2.4.4.jar,/lib/jars/mariadb-java-client-2.7.2.jar
21/03/26 05:16:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/03/26 05:16:16 WARN util.Utils: Service ‘SparkUI’ could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://hadoop-namenode:4041
Spark context available as ‘sc’ (master = local[*], app id = local-1616728576328).
Spark session available as ‘spark’.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 2.4.5
//

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sparkContext.listJars.foreach(println)
spark://hadoop-namenode:52984/jars/mariadb-java-client-2.7.2.jar
spark://hadoop-namenode:52984/jars/sqljdbc42.jar
spark://hadoop-namenode:52984/jars/memsql-spark-connector_2.11-3.0.5-spark-2.4.4.jar

scala> :paste
// Entering paste mode (ctrl-D to finish)

val df = spark.read
.format(“jdbc”)
.option(“url”, “jdbc:sqlserver://xxx.xxx.x.xx:1434”)
.option(“databasename”, “xxxxx”)
.option(“dbtable”, “xxxxx”)
.option(“user”, “xxxxxxxxx”)
.option(“password”, “xxxxxxxxxxx”)
.option(“driver”,“com.microsoft.sqlserver.jdbc.SQLServerDriver”)
.load

println("Test Number of Rows: " + df.count)

// Exiting paste mode, now interpreting.

21/03/26 05:18:50 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting ‘spark.debug.maxToStringFields’ in SparkEnv.conf.
Test Number of Rows: 0
df: org.apache.spark.sql.DataFrame = [x,x,x,x: int … 75 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

spark.conf.set(“spark.datasource.singlestore.ddlEndpoint”, “xxx.xx.x.xx”)
spark.conf.set(“spark.datasource.singlestore.user”, “root”)
spark.conf.set(“spark.datasource.singlestore.password”, “xxxxxxxxx”)

val df = spark.read
.format(“singlestore”)
.load(“test.cust”)

// Exiting paste mode, now interpreting.

java.lang.ClassNotFoundException: Failed to find data source: singlestore. Please find packages at Third-Party Projects | Apache Spark
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
… 57 elided
Caused by: java.lang.ClassNotFoundException: singlestore.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
… 59 more

iblinov-ua · April 9, 2021, 11:08am

Hi @benji,
you’re using 3.0.5 version of spark-connector, that version was released when we were still called MemSQL. We’ve added singlestore format only in the 3.0.6 version (the latest current version is 3.0.7).
So you should rather use version >= 3.0.6 of spark-connector or use memsql as a format, e.g.

val df = spark.read
  .format(“memsql”)
  .load(“test.cust”)