Integrating pandas with SingleStoreDB
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
This notebook will show how to move data from a pandas DataFrame
into SingleStoreDB as well
as downloading a SingleStoreDB query to a pandas DataFrame
. It should be noted that this
is only intended for relatively small data sets and to do processing that can't otherwise
be done in SingleStoreDB itself. Moving data to the client for processing should only be done
when there is no other alternative in the database.
In [1]:
import ibisimport pandas as pdimport singlestoredb as s2import sqlalchemy as sa
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Create a database
We need to create a database to work with in the following examples.
In [2]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS pandas_integration;%sql CREATE DATABASE pandas_integration;
Action Required
Make sure to select the pandas_integration database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Database connections
In the notebooks environment, the connection string for the currently selected database is kept
in the connection_url
variable as well as the SINGLESTOREDB_URL
environment variable.
The connection variables are accessed automatically within the SingleStoreDB Python packages
so that you do not need connection parameters when connecting.
In the following sections, we will connect to SingleStoreDB using each of the packages and demonstrate techniques for moving data between pandas and SingleStoreDB.
The Iris data set
We'll be using the Iris data set for the following examples. This data set includes five columns: sepal_length
,
sepal_width
, petal_length
, petal_width
and class
.
In [3]:
iris = pd.read_csv('https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/notebooks/integrating-with-pandas/data/iris.csv')iris
As you can see below, the first four columns are floats and the last column is a string
(represented as an object
in DataFrame
s).
In [4]:
iris.info()
Moving data between SingleStoreDB and pandas DataFrame
s
Moving data from pandas DataFrame
s to SingleStoreDB tables can be done in various ways from Python and even
from each of the packages described here. This reference is to show the best techniques when using each
package.
It should be noted that moving data back-and-forth between pandas and SingleStoreDB should only be done when absolutely needed since this can be a major bottleneck when working with and analyzing data. The hope is that the features of SingleStoreDB are sufficient enough to alleviate the need to do much processing (if any) on the client machine.
SingleStoreDB Python
The core library is the SingleStoreDB Python package. This is the package that all other SingleStoreDB
packages are built on. To connect, simply call the connect
function.
In [5]:
s2_conn = s2.connect()
Since the core library is a fairly low-level interface to SingleStoreDB, most operations are done simply by sending SQL code.
Creating a table
Because we are using a low-level driver, creating a table is just done using SQL code. We'll use the information
about the data set above to construct a CREATE TABLE
statement.
In [6]:
s2_cur = s2_conn.cursor()s2_cur.execute(r'''CREATE TABLE IF NOT EXISTS iris (sepal_length DOUBLE,sepal_width DOUBLE,petal_length DOUBLE,petal_width DOUBLE,class TEXT)''')
Upload the data from a DataFrame
Now that we have a table, we can populate it with data from the DataFrame
. Again, we will use
SQL statements to do this. The Python client can execute single SQL statements using the
execute
method as used above, but since we are uploading multiple rows of data it is better
to use the executemany
method since it is optimized for this purpose.
In [7]:
# Construct the list of column namescols = ', '.join(f'`{x}`' for x in iris.columns)# Construct a list of value placeholders for the INSERT statementvalues = ', '.join(['%s'] * len(iris.columns))# Get data as a list of tuples (not including the index)data = list(iris.itertuples(index=False))# Execute the INSERT statements2_cur.executemany(f'INSERT INTO iris({cols}) VALUES ({values})', data)
We can select a sample of the rows to see that the data is now in SingleStoreDB.
In [8]:
s2_cur.execute('SELECT * FROM iris LIMIT 10')for row in s2_cur:print(row)
Downloading the data to a DataFrame
We can download the data to a pandas DataFrame
simply by selecting all columns of data,
fetching all of the rows, and passing them to the DataFrame
constructor.
In [9]:
s2_cur.execute('SELECT * FROM iris')# Use the `description` attribute to get the column namesnames = [x[0] for x in s2_cur.description]s2_iris_df = pd.DataFrame(list(s2_cur), columns=names)s2_iris_df
In [10]:
s2_iris_df.info()
Now that we have demonstrated uploading and downloading data from a pandas DataFrame
using the
SingleStoreDB Python client, we can drop the table and move on to SQLAlchemy.
In [11]:
s2_cur.execute('DROP TABLE IF EXISTS iris')
SQLAlchemy
In addition to the core Python library, you can use SQLAlchemy to connect to SingleStoreDB. Typically, when
using SQLAlchemy, you would use the SQLAlchemy create_engine
function to create an engine, then call connect
on the engine to create connections from a pool. The SingleStoreDB Python package also has a create_engine
function that does the same thing, however, it extends the default ability by allow you to use the
SINGLESTOREDB_URL
environment variable as the connection string so that no parameters are needed for
create_engine
when used in the notebooks environment.
In [12]:
sa_eng = s2.create_engine()sa_conn = sa_eng.connect()
Uploading the data from a DataFrame
Uploading data from a DataFrame
using SQLAlchemy is much easier than the lower-level Python library.
The pandas library itself has the ability to communicate with SingleStoreDB using a SQLAlchemy connection.
In this case, the DataFrame
can create the table and populate it in one step using the to_sql
method.
The to_sql
method has various options to modify its behavior documented on the pandas web site.
In [13]:
iris.to_sql('iris', con=sa_conn, index=False, if_exists='replace')
We can verify the data is in SingleStoreDB with a simple SELECT
statement.
In [14]:
for row in sa_conn.execute(sa.text('SELECT * FROM iris LIMIT 10')):print(row)
It is also possible to use SQLAlchemy expressions to query the table rather than raw SQL strings.
In [15]:
# Create a metadata object for the databaseif sa.__version__.startswith('1'):db = sa.MetaData(bind=sa_eng)sa.MetaData.reflect(db)else:db = sa.MetaData()db.reflect(bind=sa_eng)# Get the iris table from reflected datasa_iris = db.tables['iris']# Query the iris tablequery = sa.select(sa_iris).limit(10)# Print resultsfor row in sa_conn.execute(query):print(row)
Downloading the data to a DataFrame
Downloading data to a pandas DataFrame
is very simple. The result of the execute
method can
be passed directly to the pandas DataFrame
constructor.
In [16]:
# Reset query to not include the limitquery = sa.select(sa_iris)sa_iris_df = pd.DataFrame(sa_conn.execute(query))sa_iris_df
In [17]:
sa_iris_df.info()
It is also possible to use pd.read_sql
to bookend the use of df.to_sql
.
In [18]:
sa_iris_df = pd.read_sql(query, con=sa_conn)sa_iris_df
Now that we have demonstrated using SQLAlchemy to upload and download pandas DataFrames
we can drop
the table and move on to Ibis.
In [19]:
sa_iris.drop(bind=sa_eng)
Ibis (SingleStoreDB DataFrame)
The Ibis package allows you to treat tables in SingleStoreDB as DataFrames
. The DataFrame
expressions
are used to build lazy expressions which generate SQL statements that get submitted to SingleStoreDB
only when you want to see the results of a query. Ibis using SQLAlchemy connections behind-the-scenes.
In [20]:
ibis_conn = ibis.singlestoredb.connect()
Uploading the data from a DataFrame
Ibis is intended for tight integration with pandas, so it is no surprise that uploading a
pandas DataFrame
with Ibis is straight-forward.
If you are not familiar with Ibis, you may notice the execute
call at the end of this cell.
Ibis creates expressions in memory on the client machine until a view of the data is
explicitly asked for. Once you explicitly ask for a query to be executed, it then generates
and submits the SQL code for the expression behind-the-scenes.
In this case, the ibis_iris
object is a DataFrame
-like object that is lazily constructing
the requested expression until execute
is called on it. In the case of this example, uploading
and downloading
In [21]:
ibis_iris = ibis_conn.create_table('iris', iris, overwrite=True)ibis_iris.limit(10).execute()
It is also possible to insert the data from a DataFrame
into an existing table using the insert
method
of the connection.
In [22]:
ibis_conn.insert('iris', iris)
You'll see that we now have 300 rows since we've inserted the data twice.
In [23]:
ibis_iris.count().execute()
One way to see the SQL that gets submitted during execute
is to compile the expression
and print it. Ibis also has a options to display SQL queries as they are submitted.
In [24]:
print(ibis_iris.compile())
The information about the table can be retrieved much like in a local pandas DataFrame
.
In [25]:
ibis_iris.info().execute()
Downloading the data from a DataFrame
The output from evaluating Ibis expressions returns a DataFrame
, so we have already demonstrated
downloading data, but here is the code again.
In [26]:
ibis_iris_df = ibis_iris.execute()ibis_iris_df
Ibis Table
s also have a to_pandas
method.
In [27]:
ibis_iris.to_pandas()
If you do not have an Ibis object reference to a table already, you can get one using the table
method
or tables
attribute of the Ibis connection.
In [28]:
# Use this version if the table name is not a valid Python variable nameibis_iris = ibis_conn.table('iris')# This form can be used if the table name is a valid Python variable nameibis_iris = ibis_conn.tables.iris
In [29]:
ibis_iris.limit(10).execute()
We have demonstrated both uploading and downloading pandas DataFrames
using Ibis, so
we can drop the table now.
In [30]:
ibis_conn.drop_table('iris')
%%sql
and %sql
magic commands
The IPython interpreter can be extended with "magic" commands. The SingleStore Cloud notebook environment
uses the JupySQL plugin for the %sql
, %%sql
,
and %sqlplot
commands. These work in conjunction with SQLAlchemy to allow you to type SQL code in
the notebook cells. They also have ways of integrating with pandas. The notebook environment automatically
sets up the connection string for use in these commands.
Creating a table
Creating a table with the %%sql
command is done simply by submitting the CREATE TABLE
statement.
In [31]:
%%sqlDROP TABLE IF EXISTS iris;CREATE TABLE IF NOT EXISTS iris (sepal_length DOUBLE,sepal_width DOUBLE,petal_length DOUBLE,petal_width DOUBLE,class TEXT);
Uploading the data from a DataFrame
The %sql
command has options that allow you to upload data from a DataFrame
. The --persist
option
will create a table in the database and upload the data. The --append
option will append data to an
existing table. In this case, the name used for the DataFrame
variable is used for the table name
in SingleStoreDB.
In [32]:
%sql --append --no-index iris
In [33]:
%%sqlSELECT * FROM iris LIMIT 10;
Downloading the data from a DataFrame
There are a few ways of getting data from SingleStoreDB into a DataFrame
using the SQL magic commands.
The first is to use the %sql
command and convert the result manually.
In [34]:
out = %sql SELECT * FROM irissql_iris_df = out.DataFrame()sql_iris_df
You can also pass the result of the query to the DataFrame
constructor.
In [35]:
sql_iris_df = pd.DataFrame(out)sql_iris_df
Finally, the output of the %%sql
command can be stored to a variable which can then be
converted to a DataFrame
in the same manner as above.
In [36]:
%%sql result <<SELECT * FROM iris;
In [37]:
sql_iris_df = pd.DataFrame(result)sql_iris_df
Automatically return pandas DataFrame
s
The other option for getting DataFrame
s as the result of the SQL magic commands is to enable
the SqlMagic.autopandas
option. This will cause all results from SQL magic commands to be
converted to DataFrame
s without any intervention.
In [38]:
%config SqlMagic.autopandas = True
In [39]:
out = %sql SELECT * FROM irisout
In [40]:
type(out)
Now that we have demonstrated uploading and downloading of DataFrame
s using the SQL magic commands,
we can reset the configuration options and drop the table.
In [41]:
%config SqlMagic.autopandas = False%sql DROP TABLE IF EXISTS iris
Conclusion
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
We have shown how to upload and download data from a pandas DataFrame
to and from SingleStoreDB
using the SingleStoreDB Python client, SQLAlchemy, and Ibis. These techniques should enable you to
integrate your pandas workflows with SingleStoreDB.
In [42]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS pandas_integration;
Details
About this Template
Moving data to and from pandas DataFrames is demonstrated using the SingleStoreDB Python library, SQLAlchemy, and Ibis.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.