New

Load files from S3 into Shared Tier

Notebook


SingleStore Notebooks

Load files from S3 into Shared Tier

This notebook guides you through data ingestion of CSV files from an AWS S3 location into your shared tier workspace.

Create a Pipeline from CSV files in AWS S3

In this example, we want to create a pipeline that ingests from a CSV file stored in an AWS S3 bucket. We will guide you through an example with stock market data.

Action Required

Make sure to select your database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create a Table

Start by creating a table to store the ingested data. In our example, we will create a Stocks table that will store trading data for a specific stock on a given date.

In [1]:

%%sql
CREATE TABLE stocks /* Creating table for sample data. */(
`date` date NULL,
`open` double NULL,
`high` double NULL,
`low` double NULL,
`close` double NULL,
`volume` bigint(20) NULL,
`Name` text CHARACTER SET utf8 COLLATE utf8_general_ci NULL,
SHARD KEY ()
);

Create a pipeline

We then create a pipeline by pointing the data source to the S3 bucket containing the dataset. In our case, we have a CSV file all_stocks_5yr.csv and will be ingesting it into our Stocks table via a pipeline stocks_pipeline.

To create your own pipeline, you will need the following information:

  • The path of the bucket, such as: 'helios-self-poc-stockticker/all_stocks_5yr.csv'

  • The name of the bucket’s region, such as: us-east-1

  • Your AWS account’s access credentials: <aws_access_key_id> and <aws_secret_access_key>

*For more on how to retrieve the above information, read our Pipeline Documentation.

In [2]:

%%sql
CREATE OR REPLACE PIPELINE stocks_pipeline /* Creating pipeline for sample data. */
AS LOAD DATA S3 'helios-self-poc-stockticker/all_stocks_5yr.csv'
CONFIG '{"region": "us-east-1"}'
CREDENTIALS '{"aws_access_key_id": "<your_key>",
"aws_secret_access_key": "<your_secret>"}'
BATCH_INTERVAL 45000
SKIP DUPLICATE KEY ERRORS -- SKIP ALL ERRORS can be used to skip all errors that can be tracked through "Monitor the pipeline for errors"
INTO TABLE stocks
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\n' STARTING BY ''
IGNORE 1 LINES
FORMAT CSV
(
`stocks`.`date`,
`stocks`.`open`,
`stocks`.`high`,
`stocks`.`low`,
`stocks`.`close`,
`stocks`.`volume`,
`stocks`.`Name`
);

Start and monitor the pipeline

The CREATE PIPELINE statement creates a new pipeline, but the pipeline has not yet been started, and no data has been loaded. To start a pipeline in the background, run:

In [3]:

%%sql
START PIPELINE stocks_pipeline;

In [4]:

%%sql
show pipelines;

If there is no error or warning, you should see no error message.

In [5]:

%%sql
SELECT * FROM information_schema.pipelines_errors
WHERE pipeline_name = 'stocks_pipeline';

Query the table

In [6]:

%%sql
SELECT * FROM stocks LIMIT 5;

Cleanup resources

In [7]:

%%sql
DROP PIPELINE IF EXISTS test.stocks_pipeline;
DROP TABLE IF EXISTS test.stocks;

Details


About this Template

This notebook guides you through data ingestion of CSV files from an AWS S3 location into your shared tier workspace.

Notebook Icon

This Notebook can be run in Standard and Enterprise deployments.

Tags

pipelines3

License

This Notebook has been released under the Apache 2.0 open source license.