Top 5 Questions Answered at Spark Summit


Conor Doherty

Technical Marketing Engineer at SingleStore

Top 5 Questions Answered at Spark Summit

The SingleStore team enjoyed sponsoring and attending Spark Summit last week, where we spoke with hundreds of developers, data scientists, and architects all getting a better handle on modern data processing technologies like Spark and SingleStore. After a couple of days on the expo floor, I noticed several common questions. Below are some of the most frequent questions and answers exchanged in the SingleStore booth.

1-when-should-i-use-single-store1. When should I use SingleStore?

SingleStore shines in use cases requiring analytics on a changing data set. The legacy data processing model, which creates separate siloes for transactions and analytics, prevents updated data from propagating to reports and dashboards until the nightly or weekly ETL job begins. Serving analytics from a real-time operational database means reports and dashboards are accurate up to the last event, not last week.

That said, SingleStore is a relational database and you can use it to build whatever application you want! In practice, many customers choose SingleStore because it is the only solution able to handle concurrent ingest and query execution for analyzing changing datasets in real-time.

2-what-does-single-store-have-to-do-with-spark2. What does SingleStore have to do with Spark?

Short answer: you need to persist Spark data somewhere, whether in SingleStore or in another data store. Choosing SingleStore provides several benefits including:

  • In-memory storage and data serving for maximum performance
  • Structured database schema and indexes for fast lookups and query execution
  • A connector that parallelizes data transfer and processing for high throughput

Longer answer: There are two main use cases for Spark and SingleStore:

  1. Load data through Spark into SingleStore, transforming and enriching data on the fly in SparkIn this scenario, data is structured and ready to be queried as soon as it lands in SingleStore, enabling applications like dashboards and interactive analytics on real-time data. We demonstrated this “real-time pipeline” at Spark Summit, processing and analyzing real-time energy consumption data from tens of millions of devices and appliances.
  2. Leverage the Spark DataFrame API for analytics beyond SQL using data from SingleStoreOne of the best features of Spark is the expressive but concise programming interface. In addition to enabling SingleStore users to express iterative computations, it gives them access to the many libraries that run on the Spark execution engine. The SingleStore Spark connector is optimized to push computation into SingleStore to minimize data transfer and to take advantage of the SingleStore optimizer and indexing.

3-whats-the-difference-between-single-store-and-spark-sql3. What’s the difference between SingleStore and Spark SQL?

There are several differences:

  • Spark is a data processing framework, not a database, and does not natively support persistent storage. SingleStore is a database that stores data in memory and writes logs and full database snapshots to disk for durability.
  • Spark treats datasets (RDDs) as immutable – there is currently no concept of an INSERT, UPDATE, or DELETE. You could express these concepts as a transformation, but this operation returns a new RDD rather than updating the dataset in place. In contrast, SingleStore is an operational database with full transactional semantics.
  • SingleStore supports updatable relational database indexes. The closest analogue in Spark is IndexRDD, which is currently under development, and provides updateable key/value indexes within a single thread.
  • In addition to providing a SQL server, the Spark DataFrame library is a general purpose library for manipulating structured data.

4-how-do-single-store-and-spark-interact-with-one-another4. How do SingleStore and Spark interact with one another?

The SingleStore Spark Connector is an open source tool available on the SingleStore GitHub page. Under the hood, the connector creates a mapping between SingleStore database partitions and Spark RDD partitions. It also takes advantage of both systems’ distributed architectures to load data in parallel. The connector comes with a small library that includes the SingleStoreRDD class, allowing the user to create an RDD from the result of a SQL query in SingleStore. SingleStoreRDD also comes with a method called saveToSingleStore(), which makes it easy to write data to SingleStore after processing.

5-can-i-have-one-of-those-cool-t-shirts-of-course-what-does-the-design-mean5. Can I have one of those cool t-shirts? (Of course!) What does the design mean?

The design is a graphical representation of Hybrid Transactional/Analytical Processing (HTAP), a term coined by Gartner. It refers to the convergence of transactional and analytical processing in a single database, usually for real-time analytics.

Circling back to the first question, SingleStore excels at this kind of hybrid workload. In addition to reducing latency and consolidating hardware, HTAP powers tight operational feedback loops that can create opportunities for net new revenue and bottom line cost savings. For more information on HTAP, read the Gartner Market Guide for In-Memory Databases.