New

Getting Started with Notebooks

Notebook

SingleStore Notebooks

Getting Started with Notebooks

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

What you will learn in this notebook:

Load a CSV file from our Github Repo [Python]
Ingest that file into a SingleStoreDB without defining the schema [Python]
Interact natively with the database using SQL [SQL]
Convert results to a DataFrame and visualize results with Plotly [Python]

Questions?

Reach out to us through our forum.

Enhance your notebooks with visualizations

1. Import libraries for reading data into a DataFrame

Our data set contains geographic data, so we also install Shapely to store that data in Shapely geometry objects.

In [1]:

!pip3 install shapely --quiet

import pandas as pd
import shapely.wkt

2. Load a csv file hosted in Github using Python

Notice that we are using the dtype=, parse_dates=, and converters= options of the read_csv method to convert specific columns into various data types, including geographic data in the business_location column. See the read_csv documentation for more information.

In [2]:

url = 'https://raw.githubusercontent.com/singlestore-labs/singlestoredb-samples/main/' + \
      'Sample%20datasets/csv/Restaurant_Scores_LIVES_Standard.csv'

In [3]:

def str_to_shapely(x: str) -> shapely.geometry.Point | None:
    """Convert a WKT string to a shapely object while handling NULLs."""
    return shapely.wkt.loads(x) if x else None


# Read URL directly using pd.read_csv
df = pd.read_csv(url, index_col=0,
                 # Use parse_date=, dtype=, and converters= to specify explicit data types
                 parse_dates=['inspection_date'],
                 date_format='%m/%d/%Y %H:%M:%S %p',
                 dtype=dict(business_id=int, business_phone_number=str, business_postal_code=str, inspection_score=float),
                 converters=dict(business_location=str_to_shapely))
df

Display the data types in the resulting DataFrame. Note that any objects that pandas does not support natively (e.g., strings, blobs, shapely geometries, etc.) show up as object.

In [4]:

df.dtypes

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

3. Ingest a DataFrame in a SingleStoreDB table

Create the database
Import the library to connect to the database
Create the connection to the library
Ingest the dataframe to the newly created database

Set the database name in a variable. It will be used in subsequent queries.

Here we are using the database_name variable in a %%sql cell. The syntax for including Python variables is to surround the variable name with {{ ... }}.

In [5]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
    database_name = 'getting_started_notebook'
    %sql DROP DATABASE IF EXISTS {{database_name}};
    %sql CREATE DATABASE {{database_name}};
else:
    current_database = %sql SELECT DATABASE() as CurrentDatabase
    database_name = current_database[0][0]

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

We can use SQLAlchemy and pandas to upload a DataFrame. Note that if the table does not exist, the data types will be inferred from the data. This may not result in the exact types that you desire. You can define the table in the database before uploading to get the exact types you want.

If you get an error about the database not being selected, that simply means that your connection_url does not contain a specific database to connect to. You can use the drop-down menu at the top of this notebook (immediately under the title) to select a database to work with. Changing the selection in the drop-down menu also updates the connection_url variable.

In [6]:

import sqlalchemy as sa

# Create a SQLAlchemy engine and connect
db_connection = sa.create_engine(connection_url).connect()

The SingleStoreDB Python package also adds a convenience function for SQLAlchemy connections without using the connection_url. It automatically gets the connection information from the SINGLESTOREDB_URL environment variable.

In [7]:

import singlestoredb as s2

# Create a SQLAlchemy engine and connect, without having to specify the connection URL
db_connection = s2.create_engine().connect()

# Upload the DataFrame
df.to_sql('sf_restaurant_scores', con=db_connection, if_exists='append', chunksize=1000)

4. Interact natively with the database using SQL

Read the top 10 rows from the table
Alter the table to get the date in a date format, not string
Read the number of restaurant inspections over the time in San Francisco

In [8]:

%%sql
SELECT * FROM {{database_name}}.sf_restaurant_scores LIMIT 10;

In the code block below, we use the result1 << syntax on the %%sql line to store the result of the SQL operation into a variable which can be used later. As with other Jupyter notebooks, you can always get the value of the last executed cell in the _ (underscore) variable, but setting a specifc variable name to use is generally a safer way to retrieve results.

In [9]:

%%sql result1 <<
SELECT
    DATE_TRUNC('month', inspection_date) AS month,
    COUNT(*) AS count_inspection
FROM
    {{database_name}}.sf_restaurant_scores
GROUP BY
    MONTH
ORDER BY
    MONTH DESC;

The output of a %%sql cell is a ResultSet which contains methods for converting to various other data types (e.g., csv, dicts, DataFrame, PolarsDataFrame). It is also possible to convert to a DataFrame by passing a ResultSet object to the DataFrame constructor as we'll see below.

In [10]:

type(result1)

5. Visualize with Plotly

We are using Plotly to visualize the data in result1. The first parameter of the bar function requires a DataFrame, so we'll convert result1 to a DataFrame before calling bar.

In [11]:

result1_df = pd.DataFrame(result1)
result1_df[:5]

In [12]:

import plotly.express as px

px.bar(result1_df, x='month', y='count_inspection', title='Inspections by Month')

6. Cleanup database

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [13]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
    %sql DROP DATABASE IF EXISTS {{database_name}};

Details

License

This Notebook has been released under the Apache 2.0 open source license.

Getting Started with Notebooks

Notebook

Getting Started with Notebooks

What you will learn in this notebook:

Questions?

Enhance your notebooks with visualizations

1. Import libraries for reading data into a DataFrame

2. Load a csv file hosted in Github using Python

3. Ingest a DataFrame in a SingleStoreDB table

4. Interact natively with the database using SQL

5. Visualize with Plotly

6. Cleanup database

Details

Tags

License