New

Getting Started with DataFrames in SingleStoreDB

Notebook

SingleStore Notebooks

Getting Started with DataFrames in SingleStoreDB

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

SingleStoreDB supports a backend for the Ibis to create a pandas DataFrame-like experience while use the power of a distributed database on the backend. When interacting with Ibis objects, expressions are built of the operations rather than executing them instantly. Once you reach the point where you want to see the results of the operations, you call the .execute() method which constructs a SQL query and submits it to the SingleStoreDB server. The results then come back as a local pandas DataFrame or Series. The entire workflow gives the feel of working with pandas DataFrames while giving you the ability to work with much larger data sets.

In the SingleStore Cloud environment, all of the needed Python packages are already installed. To run this notebook outside of that environment, you will need to install:

pip install 'singlestoredb[dataframe]'

The above will install the SingleStoreDB Python client, the SingleStoreDB SQLAlchemy dialect, and the SingleStoreDB Ibis backend.

In [1]:

import ibis
import pandas as pd

We'll set up a few options to make the notebook nicer to use, including the display width of DataFrames and enabling the interactive features of the Ibis package. The primary purpose of the interactive option is so that you don't have to manually run the execute method on expressions to see the results. In our notebook, the execute method will be called implicitly each time an Ibis expression is rendered by the notebook.

In [2]:

pd.options.display.max_colwidth = 120
ibis.options.verbose = False
ibis.options.interactive = True

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create a database

If you have a Standard or Premium workspace deployment, you can create a new database to run this notebook. Running the cell below to create the new database in your existing workspace. Note: this will NOT create a new database in your Free Starter Workspace.

In [3]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS ibis_getting_started;
%sql CREATE DATABASE ibis_getting_started;

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Make a connection using Ibis

Connections to Ibis are made through the ibis.singlestoredb method. The connection parameters can be given using either the suggested Python DB-API parameters (i.e., host=, user=, port=, etc.) or as a SQLAlchemy URL (e.g., user:password@host:port/database). In this case, we are getting the username and password from the environment variable SINGLESTOREDB_URL which is supplied by the SingleStore notebook environment.

In [4]:

conn = ibis.singlestoredb.connect()

Phone review data example

Our first example will use customer review data for phones. It includes information such as a reviewer name, the text of the review, a short summary of the review, a numeric rating from 1 to 5, and the date of the review. We'll first read the data into a local pandas DataFrame to do some cleanup, then upload it to the database to demonstrate the SingleStoreDB package capabilities

Read sample data into pandas DataFrame

We will read the data for this example directly from a Github repository using a URL in the pd.read_csv function. The result will be a pandas DataFrame. We will upload the data to the server in a later step.

In [5]:

phones_df = pd.read_csv('https://raw.githubusercontent.com/singlestore-labs/ibis-singlestoredb/main/examples/phones.csv')
phones_df.head(3)

We can see the column types using the info method. As you can see, there are two date columns, neither of which is in a convenient format for use as a date. We'll first create an actual datetime column from those and drop the original columns.

In [6]:

phones_df.info()

Upload the data to the server

Now that we have some data to work with, we can upload it to the server using the create_table method. This method allows you to upload data from a local DataFrame or materialize a table from an Ibis table expression.

In [7]:

conn.create_table?

The code below will create the table in the server from the given DataFrame. The data types for the database table will be inferred from the DataFrame data types. The overwrite=True is merely used here in case you run the notebook code multiple times against the same database. It will cause an existing phones table to be overwritten if it already exists.

In [8]:

phones_tbl = conn.create_table('phones', phones_df, overwrite=True)
phones_tbl.head(3)

We can get information about the table in the server by using the info method, just as you would with a local pandas DataFrame.

In [9]:

phones_tbl.info()

Here is the CREATE TABLE statement resulting from the data upload step. The show attribute of the connection gives access to all of the SHOW commands in the database such as functions, create_function, status, databases, and so on.

In [10]:

ct = conn.show.create_table('phones')
ct

To print this a nicer way, we'll just display the CreateTable column of the first result row.

In [11]:

print(ct[0]['CreateTable'])

Convert review date / time columns

This data set has a less than optimal way of storing the dates of the reviews. It has two columns that contain the same information and the review time, and neither of them is convenient to work with. Let's create a new column that contains the date in a timestamp column. To do that, we simply cast the existing unixReviewTime column to a timestamp type. We'll also use the name method to assign a name to the resulting column expression.

In [12]:

date = phones_tbl.unixReviewTime.cast('timestamp').name('date')
date

The overall rating column was uploaded as floating point data, but it only contains integers. Let's cast that to an integer.

In [13]:

overall = phones_tbl.overall.cast('int').name('overall')
overall

Now that we have our two processed columns, we can remove the old columns we don't need anymore and insert our new columns expressions with the cleaned data.

In [14]:

columns = [x for x in phones_tbl.columns if 'Time' not in x and x != 'overall'] + [overall, date]

In [15]:

phones_tbl = phones_tbl[columns]
phones_tbl

The phones_tbl object now contains both references to actual columns in the phones table as well as computed expressions. It can be thought of as a client-side view object, but still works the same way as our original table. We can use the info method to show the schema information.

In [16]:

phones_tbl.info()

It's also possible to access the schema information for traversing progamatically using the schema method of the table.

In [17]:

phones_tbl.schema()

Here are the public methods / attributes of the Schema object.

In [18]:

for name in dir(phones_tbl.schema()):
if not name.startswith('_'):
print(name)

Create a table from a table expression

As mentioned earlier, the create_table method can be used to create tables in the database from Ibis table expressions. That means that we can materialize our new table into the database if we so desire. This simply selects data from an expression into a new table, no work is being done on the client side.

In [19]:

tbl_2 = conn.create_table('phones2', phones_tbl, overwrite=True)
tbl_2

In [20]:

conn.show.tables()

In [21]:

tbl_2.info()

In [22]:

conn.show.create_table('phones2')

Tables can also be dropped using the drop_table method.

In [23]:

conn.drop_table('phones2')

DataFrame operations

Now that we have a table in the database and a DataFrame-like obect connected to it, let's do some operations with the data. For this example, we'll enable the verbose option in Ibis. This will display the SQL code that is sent to the server when evaluating the expression. In this example, we are building a sentence from various columns in the data table. We are using the reviewer name (which is capitalized), the summary, and the overall rating (which is cast to a string). All of these are concatenated together using the + as you normally would when concatenating strings in Python.

In [24]:

# Enable display of SQL that is executed in the background
ibis.options.verbose = True
texpr = (phones_tbl.reviewerName.capitalize() + \
' says "' + \
phones_tbl.summary + \
'" and gives it ' + \
phones_tbl.overall.cast('string') + \
' stars.').name('Review Summary')
texpr

As you can see from the SQL output above, the sentence is constructed using concat in the query. Capitalization is done using ucase and substring. The :param_# portions are used by Ibis to inject literal Python values from the expression. And, you'll see that the result table contains a column of string values that are the result of the expression.

String operations

There are many string operations available in Ibis. Many of them correspond to operations on pandas DataFrames and / or Python string methods. The code below prints a list of all methods on string columns.

In [25]:

for x in dir(texpr):
if not x.startswith('_'):
print(x)

Operations on other column types

Other data types have different methods that can be called. For example, this is the list of methods on date/time columns.

In [26]:

for x in dir(phones_tbl.date):
if not x.startswith('_'):
print(x)

Here are the methods for numeric columns.

In [27]:

for x in dir(phones_tbl.overall):
if not x.startswith('_'):
print(x)

In [28]:

ibis.options.verbose = False

Filter reviews containing "iphone 4"

Now that we've seen how to construct expressions that can be used in the database, let's use that expression for use in further queries. The expression below takes the result of our previous expression, lower-cases it, then checks for the occurrence of the substring "iphone 4". This will return a column of booleans which can be used to filter our original table, just like with a pandas DataFrame.

In [29]:

has_iphone_4 = texpr.lower().contains('iphone 4').name('Has iPhone 4')
has_iphone_4

First, let's use the filter expression to see how many of our generated sentences contain "iphone 4". We do this by using our previous expression as an indexing expression to our original table object (e.g., tbl[filter-expr]). This will return only the rows where the filter expression was True. We can then count the number of rows using the count method.

In [30]:

phones_tbl[has_iphone_4].count()

Now we can print selected columns from the filtered table. We can also sort them according to the overall column as shown below.

In [31]:

phones_tbl[has_iphone_4]['reviewerName', 'helpful', 'overall'].order_by(ibis.desc('overall'))

Conclusion

So far we've uploaded data to the database using create_table and a local DataFrame, we've constructed Python expressions that get translated to SQL and executed on the server, used those expressions in filtering operations, and selected columns from and sorted the results of the filtering operation. This covers a small number of the abilities of the table and column objects created by Ibis. For additional methods on the various types, see the Ibis documentation.

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [32]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS ibis_getting_started;

Details

Tags

#starter#dataframe#python

License

This Notebook has been released under the Apache 2.0 open source license.