
Load JSON files with Pipeline from S3
Notebook

Note
This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace
This notebook helps you navigate through different scenarios data ingestion of JSON files from an AWS S3 location:
Ingest JSON files in AWS S3 using wildcards with pre-defined schema
Ingest JSON files in AWS S3 using wildcards into a JSON column
Create a Pipeline from JSON files in AWS S3 using wildcards
In this example, we want to create a pipeline from two JSON files called actors1.json and actors2.json stored in an AWS S3 bucket called singlestoredb and a folder called actors. This bucket is located in us-east-1.
Each file has the following shape with nested objects and arrays:
{
"Actors": [
{
"name": "Tom Cruise",
"age": 56,
"Born At": "Syracuse, NY",
"Birthdate": "July 3, 1962",
"photo": "https://jsonformatter.org/img/tom-cruise.jpg",
"wife": null,
"weight": 67.5,
"hasChildren": true,
"hasGreyHair": false,
"children": [
"Suri",
"Isabella Jane",
"Connor"
]
},
{
"name": "Robert Downey Jr.",
"age": 53,
"Born At": "New York City, NY",
"Birthdate": "April 4, 1965",
"photo": "https://jsonformatter.org/img/Robert-Downey-Jr.jpg",
"wife": "Susan Downey",
"weight": 77.1,
"hasChildren": true,
"hasGreyHair": false,
"children": [
"Indio Falconer",
"Avri Roel",
"Exton Elias"
]
}
]
}
Create a Table
We first create a table called actors in the database demo_database
In [1]:
1
%%sql2
CREATE DATABASE IF NOT EXISTS demo_database;
Action Required
Make sure to select the demo_database database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
In [2]:
1
%%sql2
CREATE TABLE IF NOT EXISTS demo_database.actors (3
name text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,4
age int NOT NULL,5
born_at text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,6
Birthdate text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,7
photo text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,8
wife text CHARACTER SET utf8 COLLATE utf8_general_ci,9
weight float NOT NULL,10
haschildren boolean,11
hasGreyHair boolean,12
children JSON COLLATE utf8_bin NOT NULL,13
SHARD KEY ()14
);
Create a pipeline
We then create a pipeline called actors in the database demo_database. Since those files are small, batch_interval is not as important and the maximum partitions per batch is only 1. For faster performance, we recommend increasing the maximum partitions per batch. Note, that since the bucket is publcly accessible, you do not need to provide access key and secret.
In [3]:
1
%%sql2
CREATE PIPELINE if not exists demo_database.actors3
AS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'4
CONFIG '{ \"region\": \"us-east-1\" }'5
/*6
CREDENTIALS '{"aws_access_key_id": "<Key to Enter>",7
"aws_secret_access_key": "<Key to Enter>"}'8
*/9
BATCH_INTERVAL 250010
MAX_PARTITIONS_PER_BATCH 111
DISABLE OUT_OF_ORDER OPTIMIZATION12
DISABLE OFFSETS METADATA GC13
SKIP DUPLICATE KEY ERRORS14
INTO TABLE `actors`15
FORMAT JSON16
(17
actors.name <- name,18
actors.age <- age,19
actors.born_at <- `Born At`,20
actors.Birthdate <- Birthdate,21
actors.photo <- photo,22
actors.wife <- wife,23
actors.weight <- weight,24
actors.haschildren <- hasChildren,25
actors.hasGreyHair <- hasGreyHair,26
actors.children <- children27
);
Start and monitor the pipeline
In [4]:
1
%%sql2
START PIPELINE demo_database.actors;
If there is no error or warning, you should see no error message.
In [5]:
1
%%sql2
SELECT * FROM information_schema.pipelines_errors3
WHERE pipeline_name = 'actors' ;
Query the table
In [6]:
1
%%sql2
SELECT * FROM demo_database.actors;
Cleanup ressources
In [7]:
1
%%sql2
DROP PIPELINE IF EXISTS demo_database.actors;3
DROP TABLE IF EXISTS demo_database.actors;
Ingest JSON files in AWS S3 using wildcards into a JSON column
As the schema of your files might change, you might want to keep flexibility in ingesting the data into one JSON column that we name json_data. the table we create is named actors_json.
Create Table
In [8]:
1
%%sql2
CREATE TABLE IF NOT EXISTS demo_database.actors_json (3
json_data JSON NOT NULL,4
SHARD KEY ()5
);
Create a pipeline
In [9]:
1
%%sql2
CREATE PIPELINE IF NOT EXISTS demo_database.actors_json3
AS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'4
CONFIG '{ \"region\": \"us-east-1\" }'5
/*6
CREDENTIALS '{"aws_access_key_id": "<Key to Enter>",7
"aws_secret_access_key": "<Key to Enter>"}'8
*/9
BATCH_INTERVAL 250010
MAX_PARTITIONS_PER_BATCH 111
DISABLE OUT_OF_ORDER OPTIMIZATION12
DISABLE OFFSETS METADATA GC13
SKIP DUPLICATE KEY ERRORS14
INTO TABLE `actors_json`15
FORMAT JSON16
(json_data <- %);
Start and monitor pipeline
In [10]:
1
%%sql2
START PIPELINE demo_database.actors_json;
In [11]:
1
%%sql2
# Monitor and see if there is any error or warning3
SELECT * FROM information_schema.pipelines_errors4
WHERE pipeline_name = 'actors_json' ;
Query the table
In [12]:
1
%%sql2
SELECT * FROM demo_database.actors_json
Cleanup ressources
In [13]:
1
%%sql2
DROP DATABASE IF EXISTS demo_database;

Details
About this Template
This notebook will help you load JSON files from a public open AWS S3 bucket.
This Notebook can be run in Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.
See Notebook in action
Launch this notebook in SingleStore and start executing queries instantly.