Using MemSQL in AWS Glue (as source and target)

oemer.percin · September 29, 2020, 2:02pm

Dear all,

does anybody have any experience with using MemSQL in AWS Glue?
We have tried to read and write data , however we are getting the following error:

“An error occurred while calling o265.pyWriteDynamicFrame. Feature ‘CURSOR’ is not supported by MemSQL.3.1”

Our JDBC string looks like this:
jdbc:mysql://host:3306/database

jdbc:mariadb… seems to not be supported by AWS.

Hopefully someone has tried this, or is able to help us.

Thanks,
Ömer

carl · September 30, 2020, 7:28pm

Hi Ömer! We are actually working on this right now and have very recently (couple days ago) validated that it’s possible to use our MemSQL Spark Connector directly with AWS Glue. You can get our connector from GitHub - memsql/singlestore-spark-connector: A connector for MemSQL and Spark and here are the rough steps to get started:

Create an S3 bucket and folder.
Add the Spark Connector files to the folder.
Create another folder in the same bucket to be used as the Glue temporary directory in later steps.
Switch to the AWS Glue Service.
Click on Jobs on the left panel under ETL.
Provide a name for the job.
Select an IAM role. Create a new IAM role if one doesn’t already exist and be sure to add all Glue policies to this role.
Select the option for A new script to be authored by you.
Give the script a name.
Set the temporary directory to the one you created.
Expand Script libraries and job parameters:
Under the Dependent jars path, add entries for Spark Connector .jar files

Please let us know if you are able to get this working.

Cheers!

iblinov-ua · September 30, 2020, 8:45pm

Hi Ömer, it seems like you have an old version of MemSQL (3.1). Currently, the latest version of MemSQL is 7.1. An update to this version may solve your problem.
In addition, you may use the MariaDB driver in AWS Glue by providing a MariaDB jar in a Dependent jars path of your ETL Job.

Please let us know if it works for you.
Best Regards!

oemer.percin · October 8, 2020, 11:37am

Hi,
we are acutally using the most recent version of MemSQL (7.1).
I believe the shown error is a result of the MySQL drivers that were used, or some function that MemSQL does not support.

The MariaDB driver is not even able to connect to the database, at all.
We have also received feedback from AWS, that MemSQL support natively is not planned (as expected tbh).

We will try to implement carl’s solution and reach out again. Thanks
Kind regards,
Ömer

oemer.percin · October 8, 2020, 12:09pm

Hi carl,

thanks for the description. We were not able to get it to work, though. We are receiving the following error now: “Command failed with exit code 1”
Could you please provide a more detailed description of what needs to be done, please?
To give you a better understanding of what we want to do:
We would like to run ETL tasks within Glue, that allow us to push and pull data from various sources/destinations (that are being supported by Glue, e.g. S3, SQL Server etc.) and transform data within Glue. Ideally we’d be able to run SQL statements in MemSQL, that are being triggered from Glue.
(We have read though, that this is not possible and we’d need to use Scala to run SQL queries on dataframes/datasets that are in Glue, and not in MemSQL).
At the moment we can’t even push data from S3 to MemSQL using Glue (we know this can be easily done using MemSQL pipelines - our final goal is different here).
If you could provide us any help, that’d be very much appreciated.

Thanks,
Ömer

iblinov-ua · October 8, 2020, 6:23pm

Hi Ömer,
Сan you please provide us some information so that we can better understand what the problem is?

Is it possible to get a more informative error?
Is this error thrown while running the ETL job in AWS Glue?
What script are you running inside of the ETL job?

Waiting for your response.

Best Regards

oemer.percin · October 8, 2020, 7:22pm

Hi,

no, the error messages are very cryptic.
yes, it is thrown after the job is triggered. It fails
well, we are no Spark experts, so we go with what the newly released AWS Glue Studio creates. Let me provide an example below.

Kind regards,
Ömer

I got rid of most columns:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "glue_test", table_name = "jdbc_pulsedatalake_gluetesttable", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "glue_test", table_name = "jdbc_pulsedatalake_gluetesttable", transformation_ctx = "DataSource0")
## @type: ApplyMapping
## @args: [mappings = [("leistungszeitraumfrom", "timestamp", "leistungszeitraumfrom", "timestamp")], transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [frame = DataSource0]
Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("leistungszeitraumfrom", "timestamp", "leistungszeitraumfrom", "timestamp")], transformation_ctx = "Transform0")
## @type: DataSink
## @args: [database = "glue_test", format = "json", table_name = "jdbc_pulsedatalake_gluetesttable", transformation_ctx = "DataSink0"]
## @return: DataSink0
## @inputs: [frame = Transform0]
DataSink0 = glueContext.write_dynamic_frame.from_catalog(frame = Transform0, database = "glue_test", format = "json", table_name = "jdbc_pulsedatalake_gluetesttable", transformation_ctx = "DataSink0")
job.commit()

iblinov-ua · October 9, 2020, 12:09pm

Hi,
Thanks for the answer.
I can’t see in your script any memsql configuration properties (like host, port of the memsql instance, user, etc.). Could you please describe how did you configure memsql in your Glue job?

Thanks!

oemer.percin · October 9, 2020, 12:34pm

Hi,
that is exactly my point. I guess we are missing or misunderstanding something. Can you provide us some example code, please?
We configured it using the JDBC driver, but I assume this is incorrect when using the MemSQL Spart Connector.
Thanks,
Ömer

iblinov-ua · October 9, 2020, 1:46pm

Hi,
got it. Please try the following steps:

Download the following jars memsql-spark-connector, mariadb-connector, spray-json (internal lib) and put them into your some s3 bucket.
Edit your Glue job and add the following to the Jar lib path field:
s3://<your_bucket>/spray-json_2.11-1.3.5.jar,s3://<your_bucket>/memsql-spark-connector_2.11-3.0.4-spark-2.4.4.jar,s3://<your_bucket>/mariadb-java-client-2.7.0.jar

Edit your script in such way:

 options = {
     "ddlEndpoint" : "<memsql_host>:<memsql_port>",
     "database" : "<memsql_database>",
     "user" : "<memslq_user>",
     "password" : "<memsql_password>"
     "dataSource" : "memsql",
     "dbtable" : "<memsql_table>",
     "className" : "com.memsql.spark"
 }
 DataSource0 = glueContext.create_dynamic_frame_from_options(
     connection_type = "marketplace.spark",
     connection_options = options,
     transformation_ctx = "datasource")
 Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("leistungszeitraumfrom", "timestamp", "leistungszeitraumfrom", "timestamp")], transformation_ctx = "Transform0")
 DataSink0 = glueContext.write_dynamic_frame.from_options(
     frame = Transform0,
     connection_type = "marketplace.spark",
     connection_options = options)

Please let us know if it works for you.

Best Regards

oemer.percin · October 20, 2020, 1:00pm

Hi,

we have tested the steps you provided.

We created a new job with the following script:

https://pastebin.com/KbU0jF1s (pasting it here did not work for some reason)

The job details are as such:

2020-10-20 14_55_16-1645×481 56.9 KB
However, we get the following error:
The error log shows this:

2020-10-20 14_56_39-1623×781 84.2 KB

We weren’t really able to understand what we did wrong here. We assume “java.io.FileNotFoundException: File --spark-event-logs-path does not exist” points toward some issue, but did not know what to check.

Any help would be much appreciated.

Thanks,
Ömer

iblinov-ua · October 20, 2020, 3:27pm

Hi,
not sure that it’s the point, but could you check if your spark-event-log-path exists. You run your job with this job parameter, seems like it couldn’t be found.
Also if this path exists, could you please also try to run your job without this parameter?

Best Regards

oemer.percin · October 20, 2020, 5:13pm

Hi,

I have erased the spark-ui logging as suggested. Although the bucket existed. Not sure, but I believe we do not need it.

I have updated the pastebin link above, as it was cut after 24 lines.

The error we now get is as follows:
2020-10-20 19_08_59-Window

Thanks,
Ömer

iblinov-ua · October 22, 2020, 10:54am

Hi,
This error says that you have some invalid syntax on line 22, could you please check it?

Best Regards

oemer.percin · October 22, 2020, 2:12pm

Hi,

line 22 is the following one:

"dataSource" : "memsql",

Do we need to adjust this?

Kind regards,
Ömer

iblinov-ua · October 22, 2020, 4:59pm

Hi,
it’s hard to say because I don’t see your whole code, but I suppose this line should contain a tab before this code because it’s python and validation depends on the right tabs.

Best Regards

oemer.percin · October 29, 2020, 12:28pm

Hi,
is it possible, that we receive direct support in this specific instance?
Also, do you see any opportunity to run SQL code from within AWS Glue? In such a way that data transformation is not happening in Glue, but in MemSQL?
Kind regards,
Ömer

oemer.percin · November 5, 2020, 11:34am

Hi again,
I wanted to kindly ask, if there is any feedback on this?
Thanks,
Ömer

iblinov-ua · November 10, 2020, 11:22am

Hi Ömer,
Sorry for waiting.
Yes, we could schedule a call and discuss all your questions.
Please provide your available time.

Thanks!

oemer.percin · November 12, 2020, 11:38pm

Hello,

I will be available next week Monday till Thursday during mid-European business times.
Please reach out to me via email.

Thanks a lot,
Ömer