Getting Started with DataFrames in SingleStoreDB
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
SingleStoreDB supports a backend for the Ibis to
create a pandas DataFrame
-like experience while use the power of a distributed database on the
backend. When interacting with Ibis objects, expressions are built of the operations rather
than executing them instantly. Once you reach the point where you want to see the results of
the operations, you call the .execute()
method which constructs a SQL query and submits it
to the SingleStoreDB server. The results then come back as a local pandas DataFrame
or Series
.
The entire workflow gives the feel of working with pandas DataFrames
while giving you the
ability to work with much larger data sets.
In the SingleStore Cloud environment, all of the needed Python packages are already installed. To run this notebook outside of that environment, you will need to install:
pip install 'singlestoredb[dataframe]'
The above will install the SingleStoreDB Python client, the SingleStoreDB SQLAlchemy dialect, and the SingleStoreDB Ibis backend.
In [1]:
import ibisimport pandas as pd
We'll set up a few options to make the notebook nicer to use, including the display width of DataFrames and enabling the interactive features
of the Ibis package. The primary purpose of the interactive
option is so that you don't have to manually run the execute
method on expressions
to see the results. In our notebook, the execute
method will be called implicitly each time an Ibis expression is rendered by the notebook.
In [2]:
pd.options.display.max_colwidth = 120ibis.options.verbose = Falseibis.options.interactive = True
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Create a database
If you have a Standard or Premium workspace deployment, you can create a new database to run this notebook. Running the cell below to create the new database in your existing workspace. Note: this will NOT create a new database in your Free Starter Workspace.
In [3]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS ibis_getting_started;%sql CREATE DATABASE ibis_getting_started;
Action Required
Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Make a connection using Ibis
Connections to Ibis are made through the ibis.singlestoredb
method. The connection parameters can be given using either the suggested Python DB-API
parameters (i.e., host=, user=, port=, etc.) or as a SQLAlchemy URL (e.g., user:password@host:port/database). In this case, we are getting the username and password from the environment variable SINGLESTOREDB_URL
which is supplied by the
SingleStore notebook environment.
In [4]:
conn = ibis.singlestoredb.connect()
Phone review data example
Our first example will use customer review data for phones. It includes information such as a reviewer name, the text of the review, a short summary of the review, a numeric rating from 1 to 5, and the date of the review. We'll first read the data into a local pandas DataFrame to do some cleanup, then upload it to the database to demonstrate the SingleStoreDB package capabilities
Read sample data into pandas DataFrame
We will read the data for this example directly from a Github repository using a URL in the pd.read_csv
function.
The result will be a pandas DataFrame
. We will upload the data to the server in a later step.
In [5]:
phones_df = pd.read_csv('https://raw.githubusercontent.com/singlestore-labs/ibis-singlestoredb/main/examples/phones.csv')phones_df.head(3)
We can see the column types using the info
method. As you can see, there are two date columns, neither of which is in a convenient
format for use as a date. We'll first create an actual datetime column from those and drop the original columns.
In [6]:
phones_df.info()
Upload the data to the server
Now that we have some data to work with, we can upload it to the server using the create_table
method. This method allows you to upload
data from a local DataFrame
or materialize a table from an Ibis table expression.
In [7]:
conn.create_table?
The code below will create the table in the server from the given DataFrame
. The data types for the database table will be inferred from
the DataFrame
data types. The overwrite=True
is merely used here in case you run the notebook code multiple times against the same
database. It will cause an existing phones
table to be overwritten if it already exists.
In [8]:
phones_tbl = conn.create_table('phones', phones_df, overwrite=True)phones_tbl.head(3)
We can get information about the table in the server by using the info
method, just as you would with a local pandas DataFrame
.
In [9]:
phones_tbl.info()
Here is the CREATE TABLE
statement resulting from the data upload step. The show
attribute of the connection gives access to all of
the SHOW
commands in the database such as functions
, create_function
, status
, databases
, and so on.
In [10]:
ct = conn.show.create_table('phones')ct
To print this a nicer way, we'll just display the CreateTable
column of the first result row.
In [11]:
print(ct[0]['CreateTable'])
Convert review date / time columns
This data set has a less than optimal way of storing the dates of the reviews. It has two columns that contain the same information and the review time, and neither
of them is convenient to work with. Let's create a new column that contains the date in a timestamp column. To do that, we simply cast the
existing unixReviewTime
column to a timestamp type. We'll also use the name
method to assign a name to the resulting column expression.
In [12]:
date = phones_tbl.unixReviewTime.cast('timestamp').name('date')date
The overall
rating column was uploaded as floating point data, but it only contains integers. Let's cast that to an integer.
In [13]:
overall = phones_tbl.overall.cast('int').name('overall')overall
Now that we have our two processed columns, we can remove the old columns we don't need anymore and insert our new columns expressions with the cleaned data.
In [14]:
columns = [x for x in phones_tbl.columns if 'Time' not in x and x != 'overall'] + [overall, date]
In [15]:
phones_tbl = phones_tbl[columns]phones_tbl
The phones_tbl
object now contains both references to actual columns in the phones
table as well as computed expressions.
It can be thought of as a client-side view object, but still works the same way as our original table. We can use the info
method to show the schema information.
In [16]:
phones_tbl.info()
It's also possible to access the schema information for traversing progamatically using the schema
method of the table.
In [17]:
phones_tbl.schema()
Here are the public methods / attributes of the Schema
object.
In [18]:
for name in dir(phones_tbl.schema()):if not name.startswith('_'):print(name)
Create a table from a table expression
As mentioned earlier, the create_table
method can be used to create tables in the database from Ibis table
expressions. That means that we can materialize our new table into the database if we so desire. This simply
selects data from an expression into a new table, no work is being done on the client side.
In [19]:
tbl_2 = conn.create_table('phones2', phones_tbl, overwrite=True)tbl_2
In [20]:
conn.show.tables()
In [21]:
tbl_2.info()
In [22]:
conn.show.create_table('phones2')
Tables can also be dropped using the drop_table
method.
In [23]:
conn.drop_table('phones2')
DataFrame operations
Now that we have a table in the database and a DataFrame
-like obect connected to it, let's do some operations with the data. For this example,
we'll enable the verbose
option in Ibis. This will display the SQL code that is sent to the server when evaluating the expression. In this
example, we are building a sentence from various columns in the data table. We are using the reviewer name (which is capitalized), the summary,
and the overall rating (which is cast to a string). All of these are concatenated together using the +
as you normally would when
concatenating strings in Python.
In [24]:
# Enable display of SQL that is executed in the backgroundibis.options.verbose = Truetexpr = (phones_tbl.reviewerName.capitalize() + \' says "' + \phones_tbl.summary + \'" and gives it ' + \phones_tbl.overall.cast('string') + \' stars.').name('Review Summary')texpr
As you can see from the SQL output above, the sentence is constructed using concat
in the query. Capitalization is done using ucase
and substring
.
The :param_#
portions are used by Ibis to inject literal Python values from the expression. And, you'll see that the result table contains a column
of string values that are the result of the expression.
String operations
There are many string operations available in Ibis. Many of them correspond to operations on pandas DataFrames and / or Python string methods. The code below prints a list of all methods on string columns.
In [25]:
for x in dir(texpr):if not x.startswith('_'):print(x)
Operations on other column types
Other data types have different methods that can be called. For example, this is the list of methods on date/time columns.
In [26]:
for x in dir(phones_tbl.date):if not x.startswith('_'):print(x)
Here are the methods for numeric columns.
In [27]:
for x in dir(phones_tbl.overall):if not x.startswith('_'):print(x)
In [28]:
ibis.options.verbose = False
Filter reviews containing "iphone 4"
Now that we've seen how to construct expressions that can be used in the database, let's use that expression for use in further queries. The expression below takes the result of our previous expression, lower-cases it, then checks for the occurrence of the substring "iphone 4". This will return a column of booleans which can be used to filter our original table, just like with a pandas DataFrame.
In [29]:
has_iphone_4 = texpr.lower().contains('iphone 4').name('Has iPhone 4')has_iphone_4
First, let's use the filter expression to see how many of our generated sentences contain "iphone 4". We do this by using our previous
expression as an indexing expression to our original table object (e.g., tbl[filter-expr]
). This will return only the rows where
the filter expression was True
. We can then count the number of rows using the count
method.
In [30]:
phones_tbl[has_iphone_4].count()
Now we can print selected columns from the filtered table. We can also sort them according to the overall
column as shown below.
In [31]:
phones_tbl[has_iphone_4]['reviewerName', 'helpful', 'overall'].order_by(ibis.desc('overall'))
Conclusion
So far we've uploaded data to the database using create_table
and a local DataFrame
, we've constructed Python expressions that get translated
to SQL and executed on the server, used those expressions in filtering operations, and selected columns from and sorted the results of
the filtering operation. This covers a small number of the abilities of the table and column objects created by Ibis. For additional methods
on the various types, see the Ibis documentation.
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
In [32]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS ibis_getting_started;
Details
About this Template
Data in SingleStoreDB can be queried and modified using a familiar DataFrame syntax in addition to SQL.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.