Using MemSQL in AWS Glue (as source and target)

Dear all,

does anybody have any experience with using MemSQL in AWS Glue?
We have tried to read and write data , however we are getting the following error:

“An error occurred while calling o265.pyWriteDynamicFrame. Feature ‘CURSOR’ is not supported by MemSQL.3.1”

Our JDBC string looks like this:
jdbc:mysql://host:3306/database

jdbc:mariadb… seems to not be supported by AWS.

Hopefully someone has tried this, or is able to help us.

Thanks,
Ömer

Hi Ömer! We are actually working on this right now and have very recently (couple days ago) validated that it’s possible to use our MemSQL Spark Connector directly with AWS Glue. You can get our connector from https://github.com/memsql/memsql-spark-connector and here are the rough steps to get started:

  • Create an S3 bucket and folder.
  • Add the Spark Connector files to the folder.
  • Create another folder in the same bucket to be used as the Glue temporary directory in later steps.
  • Switch to the AWS Glue Service.
  • Click on Jobs on the left panel under ETL.
  • Provide a name for the job.
  • Select an IAM role. Create a new IAM role if one doesn’t already exist and be sure to add all Glue policies to this role.
  • Select the option for A new script to be authored by you.
  • Give the script a name.
  • Set the temporary directory to the one you created.
  • Expand Script libraries and job parameters:
  • Under the Dependent jars path, add entries for Spark Connector .jar files

Please let us know if you are able to get this working.

Cheers!

Hi Ömer, it seems like you have an old version of MemSQL (3.1). Currently, the latest version of MemSQL is 7.1. An update to this version may solve your problem.
In addition, you may use the MariaDB driver in AWS Glue by providing a MariaDB jar in a Dependent jars path of your ETL Job.

Please let us know if it works for you.
Best Regards!

Hi,
we are acutally using the most recent version of MemSQL (7.1).
I believe the shown error is a result of the MySQL drivers that were used, or some function that MemSQL does not support.

The MariaDB driver is not even able to connect to the database, at all.
We have also received feedback from AWS, that MemSQL support natively is not planned (as expected tbh).

We will try to implement carl’s solution and reach out again. Thanks
Kind regards,
Ömer

Hi carl,

thanks for the description. We were not able to get it to work, though. We are receiving the following error now: “Command failed with exit code 1”
Could you please provide a more detailed description of what needs to be done, please?
To give you a better understanding of what we want to do:
We would like to run ETL tasks within Glue, that allow us to push and pull data from various sources/destinations (that are being supported by Glue, e.g. S3, SQL Server etc.) and transform data within Glue. Ideally we’d be able to run SQL statements in MemSQL, that are being triggered from Glue.
(We have read though, that this is not possible and we’d need to use Scala to run SQL queries on dataframes/datasets that are in Glue, and not in MemSQL).
At the moment we can’t even push data from S3 to MemSQL using Glue (we know this can be easily done using MemSQL pipelines - our final goal is different here).
If you could provide us any help, that’d be very much appreciated.

Thanks,
Ömer

Hi Ömer,
Сan you please provide us some information so that we can better understand what the problem is?

  1. Is it possible to get a more informative error?
  2. Is this error thrown while running the ETL job in AWS Glue?
  3. What script are you running inside of the ETL job?

Waiting for your response.

Best Regards

Hi,

  1. no, the error messages are very cryptic.
  2. yes, it is thrown after the job is triggered. It fails
  3. well, we are no Spark experts, so we go with what the newly released AWS Glue Studio creates. Let me provide an example below.

Kind regards,
Ömer

I got rid of most columns:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "glue_test", table_name = "jdbc_pulsedatalake_gluetesttable", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "glue_test", table_name = "jdbc_pulsedatalake_gluetesttable", transformation_ctx = "DataSource0")
## @type: ApplyMapping
## @args: [mappings = [("leistungszeitraumfrom", "timestamp", "leistungszeitraumfrom", "timestamp")], transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [frame = DataSource0]
Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("leistungszeitraumfrom", "timestamp", "leistungszeitraumfrom", "timestamp")], transformation_ctx = "Transform0")
## @type: DataSink
## @args: [database = "glue_test", format = "json", table_name = "jdbc_pulsedatalake_gluetesttable", transformation_ctx = "DataSink0"]
## @return: DataSink0
## @inputs: [frame = Transform0]
DataSink0 = glueContext.write_dynamic_frame.from_catalog(frame = Transform0, database = "glue_test", format = "json", table_name = "jdbc_pulsedatalake_gluetesttable", transformation_ctx = "DataSink0")
job.commit()

Hi,
Thanks for the answer.
I can’t see in your script any memsql configuration properties (like host, port of the memsql instance, user, etc.). Could you please describe how did you configure memsql in your Glue job?

Thanks!

Hi,
that is exactly my point. I guess we are missing or misunderstanding something. Can you provide us some example code, please?
We configured it using the JDBC driver, but I assume this is incorrect when using the MemSQL Spart Connector.
Thanks,
Ömer

Hi,
got it. Please try the following steps:

  1. Download the following jars memsql-spark-connector, mariadb-connector, spray-json (internal lib) and put them into your some s3 bucket.

  2. Edit your Glue job and add the following to the Jar lib path field:
    s3://<your_bucket>/spray-json_2.11-1.3.5.jar,s3://<your_bucket>/memsql-spark-connector_2.11-3.0.4-spark-2.4.4.jar,s3://<your_bucket>/mariadb-java-client-2.7.0.jar

  3. Edit your script in such way:

     options = {
         "ddlEndpoint" : "<memsql_host>:<memsql_port>",
         "database" : "<memsql_database>",
         "user" : "<memslq_user>",
         "password" : "<memsql_password>"
         "dataSource" : "memsql",
         "dbtable" : "<memsql_table>",
         "className" : "com.memsql.spark"
     }
     DataSource0 = glueContext.create_dynamic_frame_from_options(
         connection_type = "marketplace.spark",
         connection_options = options,
         transformation_ctx = "datasource")
     Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("leistungszeitraumfrom", "timestamp", "leistungszeitraumfrom", "timestamp")], transformation_ctx = "Transform0")
     DataSink0 = glueContext.write_dynamic_frame.from_options(
         frame = Transform0,
         connection_type = "marketplace.spark",
         connection_options = options)
    

Please let us know if it works for you.

Best Regards

Hi,

we have tested the steps you provided.

  1. We created a new job with the following script:

https://pastebin.com/KbU0jF1s (pasting it here did not work for some reason)

  1. The job details are as such:

  2. However, we get the following error:
    2020-10-20 14_58_56-

  3. The error log shows this:

We weren’t really able to understand what we did wrong here. We assume “java.io.FileNotFoundException: File --spark-event-logs-path does not exist” points toward some issue, but did not know what to check.

Any help would be much appreciated.

Thanks,
Ömer

Hi,
not sure that it’s the point, but could you check if your spark-event-log-path exists. You run your job with this job parameter, seems like it couldn’t be found.
Also if this path exists, could you please also try to run your job without this parameter?

Best Regards

Hi,

I have erased the spark-ui logging as suggested. Although the bucket existed. Not sure, but I believe we do not need it.

I have updated the pastebin link above, as it was cut after 24 lines.

The error we now get is as follows:
2020-10-20 19_08_59-Window

Thanks,
Ömer

Hi,
This error says that you have some invalid syntax on line 22, could you please check it?

Best Regards

Hi,

line 22 is the following one:

"dataSource" : "memsql",

Do we need to adjust this?

Kind regards,
Ömer

Hi,
it’s hard to say because I don’t see your whole code, but I suppose this line should contain a tab before this code because it’s python and validation depends on the right tabs.

Best Regards