Enabling Coronavirus Research with SingleStore and SafeGraph

As part of the #singlestore4good initiative, hosted at SingleStore.org, SingleStore is excited to contribute to the SafeGraph COVID-19 Data Consortium. The Consortium provides free access to relevant datasets to researchers, non-profits, and governments. SingleStore has created a repository to help existing customers, and those who wish to get started with SingleStore for free, to use our software in processing coronavirus-related data.

SafeGraph is a company that offers point of interest (POI), business listing, and foot traffic data. They have started the COVID-19 Data Consortium to enable access to free data for responses to the worldwide coronavirus crisis. SingleStore joins more than 1,000 organizations contributing to data consortium, including the US Centers for Disease Control, the California Governor’s Office, and Johns Hopkins Hospital.

It’s easy to get started using SafeGraph’s COVID-19 datasets today and to gain the benefits of leveraging speed, scalability, and SQL with SingleStore, for free. As CDC director Robert Redfield told the US Senate health committee, “There are a number of counties that are still doing this pen and pencil.” At SingleStore, we encourage data-led approaches to the coronavirus crisis.

Along with joining the consortium, we have created a repository, with scripts and SQL files, to help you quickly get started using existing SafeGraph and American Census data, running on a SingleStore Cluster. Check out the repo here: https://github.com/memsql/memsql-covid-datasets.

While this blog post is focused on the COVID-19 (coronavirus) research effort, this is just one of the worthy causes we are supporting. Qualified organizations interested in an enterprise license, or free Singlestore Helios usage, can reach out to us at SingleStore.

How to Get Started

Here is an example of how quickly you can get started doing your own COVID-19 research, using the American Census dataset included in the repository.

1. Create a Singlestore Helios Trial

The following example requires a running SingleStore cluster. If you are already a SingleStore customer, with either a free license or an Enterprise license, you can use your existing SingleStore instance. If you do not yet have a SingleStore instance, we suggest spinning up a free trial on Singlestore Helios.

2. Clone the Datasets Repository from Github to Your Machine

Carry out these commands on your computer to load data into your SingleStore cluster.

# Start by cloning the datasets repo to your machine

git clone https://github.com/memsql/memsql-covid-datasets.git

# Next, you will need to export some environment variables

# to specify connection details for your SingleStore cluster:

export MEMSQL_HOST="*******.db.singlestore.com"

export MEMSQL_PORT="3306"

export MEMSQL_USER="admin"

export MEMSQL_PASS="************"

export MEMSQL_DB="census"

# For convenience we will define a temporary wrapper around the MySQL CLI

alias memsql="mysql -h ${MEMSQL_HOST} -P ${MEMSQL_PORT} -u ${MEMSQL_USER} -p${MEMSQL_PASS} "

3. Create the Database and Load the Data

Again, carry out these commands on your machine.

1
# Create the database
2
memsql -e "CREATE DATABASE ${MEMSQL_DB} "
3

4
# Now you can load the data!
5
cat memsql-covid-datasets/american_census/schema.sql | memsql ${MEMSQL_DB} cat memsql-covid-datasets/american_census/pipelines.sql | memsql ${MEMSQL_DB} # At this point, all of the tables are loaded and the data is streaming
6
# in via SingleStore Pipelines.  Open the SingleStore console and take a look:
7

8
memsql ${MEMSQL_DB} ...
9

10
MySQL [census]> show tables;
11
+--------------------+
12
| Tables_in_census   |
13
+--------------------+
14
| attribute          |
15
| field              |
16
| fips_codes         |
17
| geographic_summary |
18
| geometry           |
19
+--------------------+
20
5 rows in set (0.19 sec)
21

22
MySQL [census]> show pipelines;
23
+---------------------+---------+-----------+
24
| Pipelines_in_census | State   | Scheduled |
25
+---------------------+---------+-----------+
26
| attribute           | Running | True      |
27
| field               | Running | False     |
28
| fips_codes          | Running | False     |
29
| geographic_summary  | Running | False     |
30
| geometry            | Running | True      |
31
+---------------------+---------+-----------+
32
5 rows in set (0.10 sec)
33

34
MySQL [census]> select pipeline_name, batch_state, start_time, rows_per_sec from information_schema.pipelines_batches_summary;
35
+--------------------+-------------+----------------------------+--------------------+
36
| pipeline_name      | batch_state | start_time                 | rows_per_sec       |
37
+--------------------+-------------+----------------------------+--------------------+
38
| attribute          | Queued      | 2020-07-02 00:18:01.825899 |               NULL |
39
| geometry           | In Progress | 2020-07-02 00:17:32.345491 |  582.9841048751618 |
40
| fips_codes         | Succeeded   | 2020-07-02 00:17:31.536224 | 105.77593498406135 |
41
| attribute          | Succeeded   | 2020-07-02 00:17:30.883479 | 4124020.8754720194 |
42
| field              | Succeeded   | 2020-07-02 00:17:31.219881 | 248.05038971989336 |
43
| geographic_summary | Succeeded   | 2020-07-02 00:17:31.988135 |  6971.634916464289 |
44
+--------------------+-------------+----------------------------+--------------------+
45
6 rows in set (0.28 sec)

Note: Loading all of the data can take some time. You can check the progress of all the pipelines using the following query:

1
MySQL [census]> select pipeline_name, sum(if (file_state = 'loaded', 1, 0)) / count(*) as progress from information_schema.pipelines_files group by pipeline_name;
2
+--------------------+----------+
3
| pipeline_name      | progress |
4
+--------------------+----------+
5
| attribute          |   1.0000 |
6
| field              |   1.0000 |
7
| geographic_summary |   1.0000 |
8
| geometry           |   1.0000 |
9
| fips_codes         |   1.0000 |
10
+--------------------+----------+
11
5 rows in set (0.14 sec)

4. Run Initial Queries

As with the previous two steps, carry out these commands on the server instance that will run your SingleStore cluster.

Once the data is loaded, let’s run an example query!

How many healthcare workers are registered in the Census?

1
MySQL [census]> select sum(value) from field, attribute where field_level_4 like "%Healthcare%" and field_level_6 not like "%margin of error%" and id = field_id;
2
+------------+
3
| sum(value) |
4
+------------+
5
|   14845235 |
6
+------------+
7
1 row in set (0.445 sec)

Which counties have the largest number of registered healthcare workers?

1
MySQL [census]> select state, county, sum(value) from field, attribute, geometry where field_level_4 like "%Healthcare%" and field_level_6 not like "%margin of error%" and id = field_id and attribute.census_block_group = geometry.census_block_group group by 1, 2 order by 3 desc limit 10;
2
+-------+--------------------+------------+
3
| state | county             | sum(value) |
4
+-------+--------------------+------------+
5
| CA    | Los Angeles County |     379064 |
6
| IL    | Cook County        |     232916 |
7
| AZ    | Maricopa County    |     183230 |
8
| TX    | Harris County      |     172601 |
9
| CA    | San Diego County   |     136268 |
10
| CA    | Orange County      |     128229 |
11
| FL    | Miami-Dade County  |     111558 |
12
| NY    | Queens County      |     102049 |
13
| NY    | Kings County       |     100497 |
14
| WA    | King County        |      95217 |
15
+-------+--------------------+------------+
16
10 rows in set (0.768 sec)

Taking the Next Step

Hopefully, that gives you a taste of what you can do with our new SingleStore COVID-19 Datasets repo! Now that you’ve gone through the process of loading the US Census data, consider joining the SafeGraph COVID-19 Data Consortium, to get access to much more data, for free. We look forward to hearing about your progress; email us at covidresearch@singlestore.com.