New

Load files from S3 into Shared Tier

Notebook


SingleStore Notebooks

Load files from S3 into Shared Tier

This notebook guides you through data ingestion of CSV files from an AWS S3 location into your shared tier workspace.

Create a Pipeline from CSV files in AWS S3

In this example, we want to create a pipeline that ingests from a CSV file stored in an AWS S3 bucket. We will guide you through an example with stock market data.

Action Required

Make sure to select your database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create a Table

Start by creating a table to store the ingested data. In our example, we will create a Stocks table that will store trading data for a specific stock on a given date.

In [1]:

1%%sql2
3CREATE TABLE stocks /* Creating table for sample data. */(4	`date` date NULL,5	`open` double NULL,6	`high` double NULL,7	`low` double NULL,8	`close` double NULL,9	`volume` bigint(20) NULL,10	`Name` text CHARACTER SET utf8 COLLATE utf8_general_ci NULL,11	 SHARD KEY ()12);

Create a pipeline

We then create a pipeline by pointing the data source to the S3 bucket containing the dataset. In our case, we have a CSV file all_stocks_5yr.csv and will be ingesting it into our Stocks table via a pipeline stocks_pipeline.

To create your own pipeline, you will need the following information:

  • The path of the bucket, such as: 'helios-self-poc-stockticker/all_stocks_5yr.csv'

  • The name of the bucket’s region, such as: us-east-1

  • Your AWS account’s access credentials: <aws_access_key_id> and <aws_secret_access_key>

*For more on how to retrieve the above information, read our Pipeline Documentation.

In [2]:

1%%sql2CREATE OR REPLACE PIPELINE stocks_pipeline /* Creating pipeline for sample data. */3    AS LOAD DATA S3 'helios-self-poc-stockticker/all_stocks_5yr.csv'4    CONFIG '{"region": "us-east-1"}'5    CREDENTIALS '{"aws_access_key_id": "<your_key>",6                  "aws_secret_access_key": "<your_secret>"}'7    BATCH_INTERVAL 450008    SKIP DUPLICATE KEY ERRORS -- SKIP ALL ERRORS can be used to skip all errors that can be tracked through "Monitor the pipeline for errors"9    INTO TABLE stocks10    FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'11    LINES TERMINATED BY '\n' STARTING BY ''12    IGNORE 1 LINES13    FORMAT CSV14    (15    	`stocks`.`date`,16    	`stocks`.`open`,17    	`stocks`.`high`,18    	`stocks`.`low`,19    	`stocks`.`close`,20    	`stocks`.`volume`,21    	`stocks`.`Name`22    );

Start and monitor the pipeline

The CREATE PIPELINE statement creates a new pipeline, but the pipeline has not yet been started, and no data has been loaded. To start a pipeline in the background, run:

In [3]:

1%%sql2
3START PIPELINE stocks_pipeline;

In [4]:

1%%sql2show pipelines;

If there is no error or warning, you should see no error message.

In [5]:

1%%sql2
3SELECT * FROM information_schema.pipelines_errors4    WHERE pipeline_name = 'stocks_pipeline';

Query the table

In [6]:

1%%sql2
3SELECT * FROM stocks LIMIT 5;

Cleanup resources

In [7]:

1%%sql2
3DROP PIPELINE IF EXISTS test.stocks_pipeline;4DROP TABLE IF EXISTS test.stocks;

Details


About this Template

This notebook guides you through data ingestion of CSV files from an AWS S3 location into your shared tier workspace.

This Notebook can be run in Standard and Enterprise deployments.

Tags

pipelines3

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.