Employee Data Analysis JSON Dataset
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
In this example, we want to create a pipeline from multiple JSON files stored in an AWS S3 bucket called singlestoredb and a folder called employeedata. This bucket is located in ap-south-1.
Each file has the following shape with nested arrays:
{
"userId": "88-052-8576",
"jobTitleName": "Social Worker",
"firstName": "Mavis",
"lastName": "Hilldrop",
"dataofjoining": "20/09/2020",
"contactinfo": {
"city": "Dallas",
"phone": "972-454-9822",
"emailAddress": "mhilldrop0@google.ca",
"state": "TX",
"zipcode": "75241"
},
"Children": [
"Evaleen",
"Coletta",
"Leonelle"
],
"salary": 203000
}
Demo Flow
How to use this notebook
Create a database (You can skip this Step if you are using Free Starter Tier)
We need to create a database to work with in the following examples.
In [1]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS HRData;%sql CREATE DATABASE HRData;
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
In [2]:
%%sqlCREATE TABLE IF NOT EXISTS employeeData (userId text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,jobTitleName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,firstName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,lastName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,dataofjoining text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,contactinfo JSON COLLATE utf8_bin NOT NULL,salary int NOT NULL,Children JSON COLLATE utf8_bin NOT NULL);
Create Pipeline To Insert JSON Data into Table
In [3]:
%%sqlCREATE PIPELINE employeeData ASLOAD DATA S3 'singlestoreloaddata/employeedata/*.json'CONFIG '{ \"region\": \"ap-south-1\" }'/*CREDENTIALS '{"aws_access_key_id": "<Key to Enter>","aws_secret_access_key": "<Key to Enter>"}'*/INTO TABLE employeeDataFORMAT JSON(userId <- userId,jobTitleName <- jobTitleName,firstName <- firstName,lastName <- lastName,dataofjoining <- dataofjoining,contactinfo <- contactinfo,salary <- salary,Children <- Children);START PIPELINE employeeData;
Check if Data is Loaded
In [4]:
%%sqlSELECT * from employeeData limit 5;
Sample Queries
Select Top 2 Employees with highest salary risiding in State 'MS'
In [5]:
%%sqlselect * from employeeData where contactinfo::$state = 'MS' order by salary desc limit 2
Select Top 5 Cities with highest Average salary
In [6]:
%%sqlselect contactinfo::$city as City,AVG(salary) as 'Avg Salary' from employeeDatagroup by contactinfo::$city order by AVG(salary) desc limit 5
Number of employees with Children grouped by No of children
In [7]:
%%sqlSELECTJSON_LENGTH(Children) as No_of_Kids,COUNT(*) AS employees_with_childrenFROM employeeDatagroup by JSON_LENGTH(Children);
Average salary of employees who have children
In [8]:
%%sqlSELECTAVG(salary) AS average_salary_with_childrenFROM employeeDataWHERE JSON_LENGTH(Children) > 0;
Select the total and average salary by State
In [9]:
%%sqlSELECTcontactinfo::$state AS State,COUNT(*) AS 'No of Employees',SUM(salary) AS 'Total Salary',AVG(salary) AS 'Average Salary'FROM employeeDataGROUP BY contactinfo::$state limit 5;
Top 5 job title with highest number of employees
In [10]:
%%sqlSELECTjobTitleName,COUNT(*) AS num_employeesFROM employeeDataGROUP BY jobTitleName order by num_employees desc limit 5;
Select the highest and lowest salary
In [11]:
%%sqlSELECTMAX(salary) AS highest_salary,MIN(salary) AS lowest_salaryFROM employeeData;
Conclusion
We have shown how to connect to S3 using Pipelines
and insert JSON data into SinglestoreDB. These techniques should enable you to
integrate and query your JSON data with SingleStoreDB.
Clean up
Drop the pipeline using below command
In [12]:
%%sqlSTOP PIPELINE employeeData;DROP PIPELINE employeeData;
Drop data
In [13]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS HRData;else:%sql DROP TABLE employeeData;
Details
About this Template
Employee Data Analysis use case illustrates how to leverage Singlestore's capabilities to process and analyze JSON data from a Amazon S3 data source.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.