New

Employee Data Analysis JSON Dataset

Notebook


SingleStore Notebooks

Employee Data Analysis JSON Dataset

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

In this example, we want to create a pipeline from multiple JSON files stored in an AWS S3 bucket called singlestoredb and a folder called employeedata. This bucket is located in ap-south-1.

Each file has the following shape with nested arrays:

{
  "userId": "88-052-8576",
  "jobTitleName": "Social Worker",
  "firstName": "Mavis",
  "lastName": "Hilldrop",
  "dataofjoining": "20/09/2020",
  "contactinfo": {
    "city": "Dallas",
    "phone": "972-454-9822",
    "emailAddress": "mhilldrop0@google.ca",
    "state": "TX",
    "zipcode": "75241"
  },
  "Children": [
    "Evaleen",
    "Coletta",
    "Leonelle"
  ],
  "salary": 203000
}

Demo Flow

How to use this notebook

Create a database (You can skip this Step if you are using Free Starter Tier)

We need to create a database to work with in the following examples.

In [1]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS HRData;
%sql CREATE DATABASE HRData;

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [2]:

%%sql
CREATE TABLE IF NOT EXISTS employeeData /* Creating table for sample data */ (
userId text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
jobTitleName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
firstName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
lastName text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
dataofjoining text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
contactinfo JSON COLLATE utf8_bin NOT NULL,
salary int NOT NULL,
Children JSON COLLATE utf8_bin NOT NULL
);

Create Pipeline To Insert JSON Data into Table

In [3]:

%%sql
CREATE PIPELINE IF NOT EXISTS employeeData AS
LOAD DATA S3 'singlestoreloaddata/employeedata/*.json' /* Creating pipeline for sample data */
CONFIG '{ \"region\": \"ap-south-1\" }'
/*
CREDENTIALS '{"aws_access_key_id": "<Key to Enter>",
"aws_secret_access_key": "<Key to Enter>"}'
*/
INTO TABLE employeeData
FORMAT JSON
(
userId <- userId,
jobTitleName <- jobTitleName,
firstName <- firstName,
lastName <- lastName,
dataofjoining <- dataofjoining,
contactinfo <- contactinfo,
salary <- salary,
Children <- Children
);
START PIPELINE employeeData;

Check if Data is Loaded

In [4]:

%%sql
SELECT * from employeeData limit 5;

Sample Queries

Select Top 2 Employees with highest salary risiding in State 'MS'

In [5]:

%%sql
select * from employeeData where contactinfo::$state = 'MS' order by salary desc limit 2

Select Top 5 Cities with highest Average salary

In [6]:

%%sql
select contactinfo::$city as City,AVG(salary) as 'Avg Salary' from employeeData
group by contactinfo::$city order by AVG(salary) desc limit 5

Number of employees with Children grouped by No of children

In [7]:

%%sql
SELECT
JSON_LENGTH(Children) as No_of_Kids,
COUNT(*) AS employees_with_children
FROM employeeData
group by JSON_LENGTH(Children);

Average salary of employees who have children

In [8]:

%%sql
SELECT
AVG(salary) AS average_salary_with_children
FROM employeeData
WHERE JSON_LENGTH(Children) > 0;

Select the total and average salary by State

In [9]:

%%sql
SELECT
contactinfo::$state AS State,
COUNT(*) AS 'No of Employees',
SUM(salary) AS 'Total Salary',
AVG(salary) AS 'Average Salary'
FROM employeeData
GROUP BY contactinfo::$state limit 5;

Top 5 job title with highest number of employees

In [10]:

%%sql
SELECT
jobTitleName,
COUNT(*) AS num_employees
FROM employeeData
GROUP BY jobTitleName order by num_employees desc limit 5;

Select the highest and lowest salary

In [11]:

%%sql
SELECT
MAX(salary) AS highest_salary,
MIN(salary) AS lowest_salary
FROM employeeData;

Conclusion

We have shown how to connect to S3 using Pipelines and insert JSON data into SinglestoreDB. These techniques should enable you to integrate and query your JSON data with SingleStoreDB.

Clean up

Remove the '#' to uncomment and execute the queries below to clean up the pipeline and table created.

In [12]:

%%sql
#STOP PIPELINE employeeData;
#DROP PIPELINE employeeData;

Drop data

In [13]:

#shared_tier_check = %sql show variables like 'is_shared_tier'
#if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
# %sql DROP DATABASE IF EXISTS HRData;
#else:
# %sql DROP TABLE employeeData;

Details


About this Template

Employee Data Analysis use case illustrates how to leverage Singlestore's capabilities to process and analyze JSON data from a Amazon S3 data source.

Notebook Icon

This Notebook can be run in Shared Tier, Standard and Enterprise deployments.

Tags

starterloaddatajson

License

This Notebook has been released under the Apache 2.0 open source license.