New

Integrating pandas with SingleStoreDB

Notebook

SingleStore Notebooks

Integrating pandas with SingleStoreDB

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

This notebook will show how to move data from a pandas DataFrame into SingleStoreDB as well as downloading a SingleStoreDB query to a pandas DataFrame. It should be noted that this is only intended for relatively small data sets and to do processing that can't otherwise be done in SingleStoreDB itself. Moving data to the client for processing should only be done when there is no other alternative in the database.

In [1]:

import ibis
import pandas as pd
import singlestoredb as s2
import sqlalchemy as sa

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create a database

We need to create a database to work with in the following examples.

In [2]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS pandas_integration;
%sql CREATE DATABASE pandas_integration;

Action Required

Make sure to select the pandas_integration database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Database connections

In the notebooks environment, the connection string for the currently selected database is kept in the connection_url variable as well as the SINGLESTOREDB_URL environment variable. The connection variables are accessed automatically within the SingleStoreDB Python packages so that you do not need connection parameters when connecting.

In the following sections, we will connect to SingleStoreDB using each of the packages and demonstrate techniques for moving data between pandas and SingleStoreDB.

The Iris data set

We'll be using the Iris data set for the following examples. This data set includes five columns: sepal_length, sepal_width, petal_length, petal_width and class.

In [3]:

iris = pd.read_csv('https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/notebooks/integrating-with-pandas/data/iris.csv')
iris

As you can see below, the first four columns are floats and the last column is a string (represented as an object in DataFrames).

In [4]:

iris.info()

Moving data between SingleStoreDB and pandas DataFrames

Moving data from pandas DataFrames to SingleStoreDB tables can be done in various ways from Python and even from each of the packages described here. This reference is to show the best techniques when using each package.

It should be noted that moving data back-and-forth between pandas and SingleStoreDB should only be done when absolutely needed since this can be a major bottleneck when working with and analyzing data. The hope is that the features of SingleStoreDB are sufficient enough to alleviate the need to do much processing (if any) on the client machine.

SingleStoreDB Python

The core library is the SingleStoreDB Python package. This is the package that all other SingleStoreDB packages are built on. To connect, simply call the connect function.

In [5]:

s2_conn = s2.connect()

Since the core library is a fairly low-level interface to SingleStoreDB, most operations are done simply by sending SQL code.

Creating a table

Because we are using a low-level driver, creating a table is just done using SQL code. We'll use the information about the data set above to construct a CREATE TABLE statement.

In [6]:

s2_cur = s2_conn.cursor()
s2_cur.execute(r'''
CREATE TABLE IF NOT EXISTS iris (
sepal_length DOUBLE,
sepal_width DOUBLE,
petal_length DOUBLE,
petal_width DOUBLE,
class TEXT
)
''')

Upload the data from a DataFrame

Now that we have a table, we can populate it with data from the DataFrame. Again, we will use SQL statements to do this. The Python client can execute single SQL statements using the execute method as used above, but since we are uploading multiple rows of data it is better to use the executemany method since it is optimized for this purpose.

In [7]:

# Construct the list of column names
cols = ', '.join(f'`{x}`' for x in iris.columns)
# Construct a list of value placeholders for the INSERT statement
values = ', '.join(['%s'] * len(iris.columns))
# Get data as a list of tuples (not including the index)
data = list(iris.itertuples(index=False))
# Execute the INSERT statement
s2_cur.executemany(f'INSERT INTO iris({cols}) VALUES ({values})', data)

We can select a sample of the rows to see that the data is now in SingleStoreDB.

In [8]:

s2_cur.execute('SELECT * FROM iris LIMIT 10')
for row in s2_cur:
print(row)

Downloading the data to a DataFrame

We can download the data to a pandas DataFrame simply by selecting all columns of data, fetching all of the rows, and passing them to the DataFrame constructor.

In [9]:

s2_cur.execute('SELECT * FROM iris')
# Use the `description` attribute to get the column names
names = [x[0] for x in s2_cur.description]
s2_iris_df = pd.DataFrame(list(s2_cur), columns=names)
s2_iris_df

In [10]:

s2_iris_df.info()

Now that we have demonstrated uploading and downloading data from a pandas DataFrame using the SingleStoreDB Python client, we can drop the table and move on to SQLAlchemy.

In [11]:

s2_cur.execute('DROP TABLE IF EXISTS iris')

SQLAlchemy

In addition to the core Python library, you can use SQLAlchemy to connect to SingleStoreDB. Typically, when using SQLAlchemy, you would use the SQLAlchemy create_engine function to create an engine, then call connect on the engine to create connections from a pool. The SingleStoreDB Python package also has a create_engine function that does the same thing, however, it extends the default ability by allow you to use the SINGLESTOREDB_URL environment variable as the connection string so that no parameters are needed for create_engine when used in the notebooks environment.

In [12]:

sa_eng = s2.create_engine()
sa_conn = sa_eng.connect()

Uploading the data from a DataFrame

Uploading data from a DataFrame using SQLAlchemy is much easier than the lower-level Python library. The pandas library itself has the ability to communicate with SingleStoreDB using a SQLAlchemy connection. In this case, the DataFrame can create the table and populate it in one step using the to_sql method. The to_sql method has various options to modify its behavior documented on the pandas web site.

In [13]:

iris.to_sql('iris', con=sa_conn, index=False, if_exists='replace')

We can verify the data is in SingleStoreDB with a simple SELECT statement.

In [14]:

for row in sa_conn.execute(sa.text('SELECT * FROM iris LIMIT 10')):
print(row)

It is also possible to use SQLAlchemy expressions to query the table rather than raw SQL strings.

In [15]:

# Create a metadata object for the database
if sa.__version__.startswith('1'):
db = sa.MetaData(bind=sa_eng)
sa.MetaData.reflect(db)
else:
db = sa.MetaData()
db.reflect(bind=sa_eng)
# Get the iris table from reflected data
sa_iris = db.tables['iris']
# Query the iris table
query = sa.select(sa_iris).limit(10)
# Print results
for row in sa_conn.execute(query):
print(row)

Downloading the data to a DataFrame

Downloading data to a pandas DataFrame is very simple. The result of the execute method can be passed directly to the pandas DataFrame constructor.

In [16]:

# Reset query to not include the limit
query = sa.select(sa_iris)
sa_iris_df = pd.DataFrame(sa_conn.execute(query))
sa_iris_df

In [17]:

sa_iris_df.info()

It is also possible to use pd.read_sql to bookend the use of df.to_sql.

In [18]:

sa_iris_df = pd.read_sql(query, con=sa_conn)
sa_iris_df

Now that we have demonstrated using SQLAlchemy to upload and download pandas DataFrames we can drop the table and move on to Ibis.

In [19]:

sa_iris.drop(bind=sa_eng)

Ibis (SingleStoreDB DataFrame)

The Ibis package allows you to treat tables in SingleStoreDB as DataFrames. The DataFrame expressions are used to build lazy expressions which generate SQL statements that get submitted to SingleStoreDB only when you want to see the results of a query. Ibis using SQLAlchemy connections behind-the-scenes.

In [20]:

ibis_conn = ibis.singlestoredb.connect()

Uploading the data from a DataFrame

Ibis is intended for tight integration with pandas, so it is no surprise that uploading a pandas DataFrame with Ibis is straight-forward.

If you are not familiar with Ibis, you may notice the execute call at the end of this cell. Ibis creates expressions in memory on the client machine until a view of the data is explicitly asked for. Once you explicitly ask for a query to be executed, it then generates and submits the SQL code for the expression behind-the-scenes.

In this case, the ibis_iris object is a DataFrame-like object that is lazily constructing the requested expression until execute is called on it. In the case of this example, uploading and downloading

In [21]:

ibis_iris = ibis_conn.create_table('iris', iris, overwrite=True)
ibis_iris.limit(10).execute()

It is also possible to insert the data from a DataFrame into an existing table using the insert method of the connection.

In [22]:

ibis_conn.insert('iris', iris)

You'll see that we now have 300 rows since we've inserted the data twice.

In [23]:

ibis_iris.count().execute()

One way to see the SQL that gets submitted during execute is to compile the expression and print it. Ibis also has a options to display SQL queries as they are submitted.

In [24]:

print(ibis_iris.compile())

The information about the table can be retrieved much like in a local pandas DataFrame.

In [25]:

ibis_iris.info().execute()

Downloading the data from a DataFrame

The output from evaluating Ibis expressions returns a DataFrame, so we have already demonstrated downloading data, but here is the code again.

In [26]:

ibis_iris_df = ibis_iris.execute()
ibis_iris_df

Ibis Tables also have a to_pandas method.

In [27]:

ibis_iris.to_pandas()

If you do not have an Ibis object reference to a table already, you can get one using the table method or tables attribute of the Ibis connection.

In [28]:

# Use this version if the table name is not a valid Python variable name
ibis_iris = ibis_conn.table('iris')
# This form can be used if the table name is a valid Python variable name
ibis_iris = ibis_conn.tables.iris

In [29]:

ibis_iris.limit(10).execute()

We have demonstrated both uploading and downloading pandas DataFrames using Ibis, so we can drop the table now.

In [30]:

ibis_conn.drop_table('iris')

%%sql and %sql magic commands

The IPython interpreter can be extended with "magic" commands. The SingleStore Cloud notebook environment uses the JupySQL plugin for the %sql, %%sql, and %sqlplot commands. These work in conjunction with SQLAlchemy to allow you to type SQL code in the notebook cells. They also have ways of integrating with pandas. The notebook environment automatically sets up the connection string for use in these commands.

Creating a table

Creating a table with the %%sql command is done simply by submitting the CREATE TABLE statement.

In [31]:

%%sql
DROP TABLE IF EXISTS iris;
CREATE TABLE IF NOT EXISTS iris (
sepal_length DOUBLE,
sepal_width DOUBLE,
petal_length DOUBLE,
petal_width DOUBLE,
class TEXT
);

Uploading the data from a DataFrame

The %sql command has options that allow you to upload data from a DataFrame. The --persist option will create a table in the database and upload the data. The --append option will append data to an existing table. In this case, the name used for the DataFrame variable is used for the table name in SingleStoreDB.

In [32]:

%sql --append --no-index iris

In [33]:

%%sql
SELECT * FROM iris LIMIT 10;

Downloading the data from a DataFrame

There are a few ways of getting data from SingleStoreDB into a DataFrame using the SQL magic commands. The first is to use the %sql command and convert the result manually.

In [34]:

out = %sql SELECT * FROM iris
sql_iris_df = out.DataFrame()
sql_iris_df

You can also pass the result of the query to the DataFrame constructor.

In [35]:

sql_iris_df = pd.DataFrame(out)
sql_iris_df

Finally, the output of the %%sql command can be stored to a variable which can then be converted to a DataFrame in the same manner as above.

In [36]:

%%sql result <<
SELECT * FROM iris;

In [37]:

sql_iris_df = pd.DataFrame(result)
sql_iris_df

Automatically return pandas DataFrames

The other option for getting DataFrames as the result of the SQL magic commands is to enable the SqlMagic.autopandas option. This will cause all results from SQL magic commands to be converted to DataFrames without any intervention.

In [38]:

%config SqlMagic.autopandas = True

In [39]:

out = %sql SELECT * FROM iris
out

In [40]:

type(out)

Now that we have demonstrated uploading and downloading of DataFrames using the SQL magic commands, we can reset the configuration options and drop the table.

In [41]:

%config SqlMagic.autopandas = False
%sql DROP TABLE IF EXISTS iris

Conclusion

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

We have shown how to upload and download data from a pandas DataFrame to and from SingleStoreDB using the SingleStoreDB Python client, SQLAlchemy, and Ibis. These techniques should enable you to integrate your pandas workflows with SingleStoreDB.

In [42]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS pandas_integration;

Details

Tags

#starter#dataframe#python

License

This Notebook has been released under the Apache 2.0 open source license.