
Employee Data Analysis JSON Dataset
Notebook

Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
In this example, we want to create a pipeline from multiple JSON files stored in an AWS S3 bucket called singlestoredb and a folder called employeedata. This bucket is located in ap-south-1.
Each file has the following shape with nested arrays:
{
"userId": "88-052-8576",
"jobTitleName": "Social Worker",
"firstName": "Mavis",
"lastName": "Hilldrop",
"dataofjoining": "20/09/2020",
"contactinfo": {
"city": "Dallas",
"phone": "972-454-9822",
"emailAddress": "mhilldrop0@google.ca",
"state": "TX",
"zipcode": "75241"
},
"Children": [
"Evaleen",
"Coletta",
"Leonelle"
],
"salary": 203000
}
Demo Flow

How to use this notebook

Create a database (You can skip this Step if you are using Free Starter Tier)
We need to create a database to work with in the following examples.
In [1]:
1
shared_tier_check = %sql show variables like 'is_shared_tier'2
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':3
%sql DROP DATABASE IF EXISTS HRData;4
%sql CREATE DATABASE HRData;
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
In [2]:
1
%%sql2
3
CREATE TABLE IF NOT EXISTS employeeData /* Creating table for sample data */ (4
userId text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,5
jobTitleName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,6
firstName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,7
lastName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,8
dataofjoining text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,9
contactinfo JSON COLLATE utf8_bin NOT NULL,10
salary int NOT NULL,11
Children JSON COLLATE utf8_bin NOT NULL12
);
Create Pipeline To Insert JSON Data into Table
In [3]:
1
%%sql2
3
CREATE PIPELINE IF NOT EXISTS employeeData AS4
LOAD DATA S3 'singlestoreloaddata/employeedata/*.json' /* Creating pipeline for sample data */5
CONFIG '{ \"region\": \"ap-south-1\" }'6
/*7
CREDENTIALS '{"aws_access_key_id": "<Key to Enter>",8
"aws_secret_access_key": "<Key to Enter>"}'9
*/10
INTO TABLE employeeData11
FORMAT JSON12
(13
userId <- userId,14
jobTitleName <- jobTitleName,15
firstName <- firstName,16
lastName <- lastName,17
dataofjoining <- dataofjoining,18
contactinfo <- contactinfo,19
salary <- salary,20
Children <- Children21
);22
23
START PIPELINE employeeData;
Check if Data is Loaded
In [4]:
1
%%sql2
SELECT * from employeeData limit 5;
Sample Queries
Select Top 2 Employees with highest salary risiding in State 'MS'
In [5]:
1
%%sql2
select * from employeeData where contactinfo::$state = 'MS' order by salary desc limit 2
Select Top 5 Cities with highest Average salary
In [6]:
1
%%sql2
select contactinfo::$city as City,AVG(salary) as 'Avg Salary' from employeeData3
group by contactinfo::$city order by AVG(salary) desc limit 5
Number of employees with Children grouped by No of children
In [7]:
1
%%sql2
SELECT3
JSON_LENGTH(Children) as No_of_Kids,4
COUNT(*) AS employees_with_children5
FROM employeeData6
group by JSON_LENGTH(Children);
Average salary of employees who have children
In [8]:
1
%%sql2
SELECT3
AVG(salary) AS average_salary_with_children4
FROM employeeData5
WHERE JSON_LENGTH(Children) > 0;
Select the total and average salary by State
In [9]:
1
%%sql2
SELECT3
contactinfo::$state AS State,4
COUNT(*) AS 'No of Employees',5
SUM(salary) AS 'Total Salary',6
AVG(salary) AS 'Average Salary'7
FROM employeeData8
GROUP BY contactinfo::$state limit 5;
Top 5 job title with highest number of employees
In [10]:
1
%%sql2
SELECT3
jobTitleName,4
COUNT(*) AS num_employees5
FROM employeeData6
GROUP BY jobTitleName order by num_employees desc limit 5;
Select the highest and lowest salary
In [11]:
1
%%sql2
SELECT3
MAX(salary) AS highest_salary,4
MIN(salary) AS lowest_salary5
FROM employeeData;
Conclusion
We have shown how to connect to S3 using Pipelines
and insert JSON data into SinglestoreDB. These techniques should enable you to
integrate and query your JSON data with SingleStoreDB.
Clean up
Remove the '#' to uncomment and execute the queries below to clean up the pipeline and table created.
In [12]:
1
%%sql2
#STOP PIPELINE employeeData;3
4
#DROP PIPELINE employeeData;
Drop data
In [13]:
1
#shared_tier_check = %sql show variables like 'is_shared_tier'2
#if not shared_tier_check or shared_tier_check[0][1] == 'OFF':3
# %sql DROP DATABASE IF EXISTS HRData;4
#else:5
# %sql DROP TABLE employeeData;

Details
About this Template
Employee Data Analysis use case illustrates how to leverage Singlestore's capabilities to process and analyze JSON data from a Amazon S3 data source.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.
See Notebook in action
Launch this notebook in SingleStore and start executing queries instantly.