Load files from S3 into Shared Tier
Notebook
This notebook guides you through data ingestion of CSV files from an AWS S3 location into your shared tier workspace.
Create a Pipeline from CSV files in AWS S3
In this example, we want to create a pipeline that ingests from a CSV file stored in an AWS S3 bucket. We will guide you through an example with stock market data.
Action Required
Make sure to select your database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Create a Table
Start by creating a table to store the ingested data. In our example, we will create a Stocks
table that will store trading data for a specific stock on a given date.
In [1]:
%%sqlCREATE TABLE stocks (`date` date NULL,`open` double NULL,`high` double NULL,`low` double NULL,`close` double NULL,`volume` bigint(20) NULL,`Name` text CHARACTER SET utf8 COLLATE utf8_general_ci NULL,SHARD KEY ());
Create a pipeline
We then create a pipeline by pointing the data source to the S3 bucket containing the dataset. In our case, we have a CSV file all_stocks_5yr.csv
and will be ingesting it into our Stocks
table via a pipeline stocks_pipeline
.
To create your own pipeline, you will need the following information:
The path of the bucket, such as:
'helios-self-poc-stockticker/all_stocks_5yr.csv'
The name of the bucket’s region, such as:
us-east-1
Your AWS account’s access credentials:
<aws_access_key_id>
and<aws_secret_access_key>
*For more on how to retrieve the above information, read our Pipeline Documentation.
In [2]:
%%sqlCREATE OR REPLACE PIPELINE stocks_pipelineAS LOAD DATA S3 'helios-self-poc-stockticker/all_stocks_5yr.csv'CONFIG '{"region": "us-east-1"}'CREDENTIALS '{"aws_access_key_id": "<your_key>","aws_secret_access_key": "<your_secret>"}'BATCH_INTERVAL 45000SKIP DUPLICATE KEY ERRORS -- SKIP ALL ERRORS can be used to skip all errors that can be tracked through "Monitor the pipeline for errors"INTO TABLE stocksFIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'LINES TERMINATED BY '\n' STARTING BY ''IGNORE 1 LINESFORMAT CSV(`stocks`.`date`,`stocks`.`open`,`stocks`.`high`,`stocks`.`low`,`stocks`.`close`,`stocks`.`volume`,`stocks`.`Name`);
Start and monitor the pipeline
The CREATE PIPELINE statement creates a new pipeline, but the pipeline has not yet been started, and no data has been loaded. To start a pipeline in the background, run:
In [3]:
%%sqlSTART PIPELINE stocks_pipeline;
In [4]:
%%sqlshow pipelines;
If there is no error or warning, you should see no error message.
In [5]:
%%sqlSELECT * FROM information_schema.pipelines_errorsWHERE pipeline_name = 'stocks_pipeline';
Query the table
In [6]:
%%sqlSELECT * FROM stocks LIMIT 5;
Cleanup resources
In [7]:
%%sqlDROP PIPELINE IF EXISTS test.stocks_pipeline;DROP TABLE IF EXISTS test.stocks;
Details
About this Template
This notebook guides you through data ingestion of CSV files from an AWS S3 location into your shared tier workspace.
This Notebook can be run in Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.