New

Load files from S3 into Shared Tier

Notebook

SingleStore Notebooks

Load files from S3 into Shared Tier

This notebook guides you through data ingestion of CSV files from an AWS S3 location into your shared tier workspace.

Create a Pipeline from CSV files in AWS S3

In this example, we want to create a pipeline that ingests from a CSV file stored in an AWS S3 bucket. We will guide you through an example with stock market data.

Action Required

Make sure to select your database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create a Table

Start by creating a table to store the ingested data. In our example, we will create a Stocks table that will store trading data for a specific stock on a given date.

In [1]:

1
%%sql
2

3
CREATE TABLE stocks /* Creating table for sample data. */(
4
	`date` date NULL,
5
	`open` double NULL,
6
	`high` double NULL,
7
	`low` double NULL,
8
	`close` double NULL,
9
	`volume` bigint(20) NULL,
10
	`Name` text CHARACTER SET utf8 COLLATE utf8_general_ci NULL,
11
	 SHARD KEY ()
12
);

Create a pipeline

We then create a pipeline by pointing the data source to the S3 bucket containing the dataset. In our case, we have a CSV file all_stocks_5yr.csv and will be ingesting it into our Stocks table via a pipeline stocks_pipeline.

To create your own pipeline, you will need the following information:

The path of the bucket, such as: 'helios-self-poc-stockticker/all_stocks_5yr.csv'
The name of the bucket’s region, such as: us-east-1
Your AWS account’s access credentials: <aws_access_key_id> and <aws_secret_access_key>

*For more on how to retrieve the above information, read our Pipeline Documentation.

In [2]:

%%sql

CREATE OR REPLACE PIPELINE stocks_pipeline /* Creating pipeline for sample data. */

AS LOAD DATA S3 'helios-self-poc-stockticker/all_stocks_5yr.csv'

CONFIG '{"region": "us-east-1"}'

CREDENTIALS '{"aws_access_key_id": "<your_key>",

"aws_secret_access_key": "<your_secret>"}'

BATCH_INTERVAL 45000

SKIP DUPLICATE KEY ERRORS -- SKIP ALL ERRORS can be used to skip all errors that can be tracked through "Monitor the pipeline for errors"

INTO TABLE stocks

FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'

LINES TERMINATED BY '\n' STARTING BY ''

IGNORE 1 LINES

FORMAT CSV

(

`stocks`.`date`,

`stocks`.`open`,

`stocks`.`high`,

`stocks`.`low`,

`stocks`.`close`,

`stocks`.`volume`,

`stocks`.`Name`

);

Start and monitor the pipeline

The CREATE PIPELINE statement creates a new pipeline, but the pipeline has not yet been started, and no data has been loaded. To start a pipeline in the background, run:

In [3]:

%%sql

START PIPELINE stocks_pipeline;

In [4]:

1
%%sql
2
show pipelines;

If there is no error or warning, you should see no error message.

In [5]:

1
%%sql
2

3
SELECT * FROM information_schema.pipelines_errors
4
    WHERE pipeline_name = 'stocks_pipeline';

Query the table

In [6]:

1
%%sql
2

3
SELECT * FROM stocks LIMIT 5;

Cleanup resources

In [7]:

%%sql

DROP PIPELINE IF EXISTS test.stocks_pipeline;

DROP TABLE IF EXISTS test.stocks;

Details

About this Template

This notebook guides you through data ingestion of CSV files from an AWS S3 location into your shared tier workspace.

This Notebook can be run in Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

Load files from S3 into Shared Tier

Notebook

Load files from S3 into Shared Tier

Create a Pipeline from CSV files in AWS S3

Create a Table

Create a pipeline

Start and monitor the pipeline

Query the table

Cleanup resources

Details

About this Template

Tags

License

See Notebook in action