Issue while executing Spark Job from AWS EMR

ppotdar · May 22, 2020, 11:35am

Hi Team,

We are facing issue while loading data from AWS EMR/Spark to memSQL. We tried with memSQL Pipeline with proper privileges to normal user like 1. create pipeline, drop pipeline, start pipeline and stop pipeline. Still we are unable to load data.

We also tried to use Spark Connector to memSQL to load data still we are facing issue. Actual error we got is -

Access Denied for user <user_id> to database <db_name>

If we try to run job using memSQL root credentials it works fine.

Can anyone suggest what privileges we are missing.

Thanks in advance.

Prashant

adam · May 22, 2020, 9:07pm

Unlike a typical application, Spark can sometimes directly query leaf nodes in a MemSQL cluster. If I had to guess the leaf nodes in your cluster don’t have the user created on them that spark is using to query (likely the user only exists on the aggregators as this is typically the only nodes that need them).

ppotdar · May 28, 2020, 2:31pm

Thanks for reply,

We have added user and granted privileges same as master to leaf node.
Privileges assign to memsql user are -

SELECT, INSERT, UPDATE, DELETE, CREATE, CREATE TEMPORARY TABLES, LOCK TABLES, SHOW VIEW, CREATE PIPELINE, DROP PIPELINE, START PIPELINE, ALTER PIPELINE, SHOW PIPELINE

Still we are getting error while running Scala/Spark job as -

20/05/28 09:31:48 ERROR ApplicationMaster: User class threw exception: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Access denied for user ‘user_id’@’%’ to database ‘db_name’

This job is truncate and load thus after truncate instead of overwrite we tried using insert as well as append, still we are facing same issue.

Thanks in advance.
Prashant

roxannapourzand · May 29, 2020, 12:26am

Hello,

I’ve linked MemSQL’s permissions matrix here, which provides all the permissions you need for a given operation.

That said, the permissions you linked above look satisfactory for loading data, however, depending on exactly what you are doing in the statement, you may require further permissions:

Can you provide:

Both the pipeline load statement and spark write statement you used that lead to the access denied error?
The MemSQL version of your cluster?

Furthermore, regarding Spark:

Can you confirm what version of the Spark Connector you are using? We strongly recommend you use the latest version MemSQL Spark Connector 3.0
I’ve linked the README which also refers to the permissions required
Note that if you are truncating a table using OverWrite, you require SELECT, INSERT, and DROP. If you are dropping/recreating or creating a new table from scratch, you need CREATE as well. If you are appending only, that only requires the SELECT and INSERT permissions.

As Adam from our team noted above, if you are using parallel read/partition pushdown in the Spark Connector, the user writing from Spark also requires access to the MemSQL leaves:

Best,
Roxanna

ppotdar · May 29, 2020, 1:01pm

Hi Roxanna

Thanks for your prompt reply.

As mentioned earlier all nodes in cluster, I created user_id assuming Spark will execute parallel read/partition pushdown, with aforesaid privileges.

Here is the write statement of Spark Job -

statement.executeUpdate(s"delete from $target_orc where $src_filter")
target.write.format(“com.memsql.spark.connector”).mode(“append”).save(target_memsql_table)

MemSQL version is 7.0.11
MemSQL-Spark Connector version is spark-connector version: 2.0.1

We are checking with version 3 as mentioned in your reply and will let you know.

Thanks in advance.
Prashant

ppotdar · June 1, 2020, 8:20am

Hi Roxanna,

The current issue is solved by adding privileges “CREATE DATABASE” (although permission matrix says CREATE applies to database and tables, we explicitly mentioned keyword database) and SHOW METADATA. Job is working fine with connector v. 2.0.1.

Since Job is in POC phase, do you see any issue may arise further, I asked development team to go through github link you provided before actual deployment.

Thanks and Regards
Prashant

roxannapourzand · June 1, 2020, 8:15pm

Hi Prashant,

Thank you for confirming the cause of the issue around database creation permissions. If you are creating a database (and not just a table) in the Spark load command, you would require the CREATE DATABASE, as you mentioned.

Spark 3.0 Connector

As noted in previous replies, we strongly recommend using the 3.0 connector, especially if you are in the POC phase of your development; we will be deprecating the 2.0 version of the connector in the medium-term.

Not only does the 3.0 connector support robust SQL pushdown (the 2.0 connector only supports it for a few query shapes), but it is fully compatible with JDBC’s API. Additionally, the 3.0 connector is compatible with the latest versions of Spark (2.3 and 2.4), and MemSQL 7.0+ (the 2.0 Connector is only supported with pre-7.0 versions of MemSQL)

In your other forum post, you indicated that you were an enterprise customer of MemSQL. Would you mind sharing which company you are a part of? We would be happy to discuss the Spark Connector 3.0 benefits with you over a call with a product manager.

Permission Matrix Documentation

Regarding your comment here:

(although permission matrix says CREATE applies to database and tables, we explicitly mentioned keyword database)

Can you confirm in the permissions matrix where you see database creation is allowed under ‘CREATE’ permissions? We can make sure that it is clarified in our documentation.

Note that the field “Allowed Scopes” in the permissions matrix table is indicating that the CREATE permission can be granted for tables inside all databases within a cluster, to a specific database within the cluster, or to a specific table in a database. It is not indicating that the grant allows creating databases.

Additionally, you can see in this linked table that the minimum permission for creating a database is the create database permission.

Thank you!

Best,
Roxanna