Author

Conor Doherty
Technical Marketing Engineer at SingleStore

Data Intensity
Market Making with SingleStore: Simulating Billions of Stock Trades in Real Time
I woke up around 7:30 AM on Monday August 24th, checked my phone while lying in bed, and saw that I had lost thousands of dollars in my sleep. Great way to start the week…
I was not alone – few investors escaped “Black Monday” unscathed. The past several months have been full of sudden surges and declines in stock prices, and extreme volatility is apparently the “new normal” for global financial markets. Frequent and dramatic market swings put a high premium on access to real-time data. For securities traders, data processing speed and analytic latency can be the difference between getting burned and getting rich.
The challenge for trading and financial services companies is not only collecting real-time market data, but also making decisions based on that data in real time. Legacy SQL database technology lacked the speed, scale, and concurrency to ingest market data and execute analytical queries simultaneously. This forced companies into buying and/or building complex and specialized systems for analytics. However, as commercially-available database technology has caught up with the speed of financial markets, it is possible to use SQL to build sophisticated applications and analyze real-time financial data.
Simple Quote Generator
While we tend to talk about stocks as having a single, definitive price per share, the reality of securities trading is more complex. Buyers and sellers see different prices, known as bid and ask price, respectively. Public bid and ask prices are actually the best available prices chosen among all bids placed by buyers and all asks offered by sellers.
Beyond getting the best available deal, trading firms have other incentives for tracking best bid and ask prices. For one, the Securities and Exchange Commision requires that brokers always give their customers the best available price, known as National Best Bid and Offer (NBBO).
The script gen.py simulates a distribution of bid and ask quotes for fluctuating stock prices. The model for the generator is very simple and easily modified. The script treats each stock as a random walk. Every iteration begins at a base value, uses a probability distribution function to generate a spread of bids and asks, then randomly fluctuates the stock value. The entire generator is around 100 lines of Python.
Python interpretation is the bottleneck when running on even a small SingleStore cluster. Even still, it achieves high enough throughput for a realistic simulation. As performance bottlenecks go, “the database is faster than the data generator” isn’t such a bad problem.
Read Post

Product4 min Read
Understanding SingleStore in 5 Easy Questions
The funny thing about SingleStore is that it is simultaneously familiar and leading edge. On one hand, it is a relational database…
Read Post

Trending
Top 5 Questions Answered at Spark Summit
The SingleStore team enjoyed sponsoring and attending Spark Summit last week, where we spoke with hundreds of developers, data scientists, and architects all getting a better handle on modern data processing technologies like Spark and SingleStore. After a couple of days on the expo floor, I noticed several common questions. Below are some of the most frequent questions and answers exchanged in the SingleStore booth.
1. When should I use SingleStore?
SingleStore shines in use cases requiring analytics on a changing data set. The legacy data processing model, which creates separate siloes for transactions and analytics, prevents updated data from propagating to reports and dashboards until the nightly or weekly ETL job begins. Serving analytics from a real-time operational database means reports and dashboards are accurate up to the last event, not last week.
That said, SingleStore is a relational database and you can use it to build whatever application you want! In practice, many customers choose SingleStore because it is the only solution able to handle concurrent ingest and query execution for analyzing changing datasets in real-time.
2. What does SingleStore have to do with Spark?
Short answer: you need to persist Spark data somewhere, whether in SingleStore or in another data store. Choosing SingleStore provides several benefits including:
In-memory storage and data serving for maximum performanceStructured database schema and indexes for fast lookups and query executionA connector that parallelizes data transfer and processing for high throughput
Longer answer: There are two main use cases for Spark and SingleStore:
Load data through Spark into SingleStore, transforming and enriching data on the fly in SparkIn this scenario, data is structured and ready to be queried as soon as it lands in SingleStore, enabling applications like dashboards and interactive analytics on real-time data. We demonstrated this “real-time pipeline” at Spark Summit, processing and analyzing real-time energy consumption data from tens of millions of devices and appliances.Leverage the Spark DataFrame API for analytics beyond SQL using data from SingleStoreOne of the best features of Spark is the expressive but concise programming interface. In addition to enabling SingleStore users to express iterative computations, it gives them access to the many libraries that run on the Spark execution engine. The SingleStore Spark connector is optimized to push computation into SingleStore to minimize data transfer and to take advantage of the SingleStore optimizer and indexing.
3. What’s the difference between SingleStore and Spark SQL?
There are several differences:
Spark is a data processing framework, not a database, and does not natively support persistent storage. SingleStore is a database that stores data in memory and writes logs and full database snapshots to disk for durability.Spark treats datasets (RDDs) as immutable – there is currently no concept of an INSERT, UPDATE, or DELETE. You could express these concepts as a transformation, but this operation returns a new RDD rather than updating the dataset in place. In contrast, SingleStore is an operational database with full transactional semantics.SingleStore supports updatable relational database indexes. The closest analogue in Spark is IndexRDD, which is currently under development, and provides updateable key/value indexes within a single thread.In addition to providing a SQL server, the Spark DataFrame library is a general purpose library for manipulating structured data.
4. How do SingleStore and Spark interact with one another?
The SingleStore Spark Connector is an open source tool available on the SingleStore GitHub page. Under the hood, the connector creates a mapping between SingleStore database partitions and Spark RDD partitions. It also takes advantage of both systems’ distributed architectures to load data in parallel. The connector comes with a small library that includes the SingleStoreRDD class, allowing the user to create an RDD from the result of a SQL query in SingleStore. SingleStoreRDD also comes with a method called saveToSingleStore(), which makes it easy to write data to SingleStore after processing.
5. Can I have one of those cool t-shirts? (Of course!) What does the design mean?
Read Post

Trending
What We Talk About When We Talk About Real-Time
The phrase “real-time,” like love, means different things to different people.At its most basic, the term implies near simultaneity. However, the amount of time that constitutes the “real-time window” differs across industries, professions, and even organizations. Definitions vary and the term is so often (ab)used by marketers and analysts, that some dismiss “real-time” as a meaningless buzzword.However, there is an important distinction between “real-time” and “what we have now but faster.” A real-time system is not just faster, but fast enough to cross a performance threshold such that your business can reproducibly gain net new value.This abstract definition is likely too general to assuage the real-time absolutists. However, there is no way to select a single number value definition of “real-time” that works for all use cases. Rather, it’s better to talk about “real-time” as a heuristic and allow stakeholders to establish conventions tailored to their own idiosyncratic business problems.Instead of claiming real-time means X seconds, this article will describe two classes of real-time applications and their performance requirements.Machines Acting in Real-TimeOne class of real-time applications is where machines programmatically make data-driven decisions. The ability to automate data-driven decisions is especially valuable for applications where the volumes of data or demanding service level agreements (SLAs) make it impossible for the decision to hinge on human input.Example: Real-Time BiddingTake the example of digital advertising, where real-time user targeting and real-time bidding on web traffic have revolutionized the industry. Selecting a display ad or choosing whether to buy traffic based on the viewer’s demographic and browsing information can boost click-through and conversion rates. Clearly, the process of choosing ads and whether to buy traffic must be done programmatically – the volume of traffic on a busy site is too large and the decisions must be made too quickly for it to be done by humans.For this application, “real-time” means roughly “before the web page loads in the browser window.” This brief lag period is essentially free computation time while the viewer waits a fraction of a second for the page to load.This definition of real-time may not be numerically absolute, but it is well-defined. While businesses implementing real-time advertising platforms will often impose particular SLAs (“this database call must return in x milliseconds”), these time values are just heuristics representing an acceptable execution time. In practice, there may not be a hard and fast cutoff time beyond which it “doesn’t work.” The business may determine that clicks tail off at some rate as page load time lengthens, and shrinking average load time causes an increase in clickthrough rate.Example load times and clickthrough rateTime to load (s)Clickthrough rate.23.42.61.8.51.0.1This real-time window is not a discreet interval that guarantees uniform outcomes—rather, it’s defined probabilistically. Every time a user views a web page with a display ad, we know they will click on an ad with some probability (i.e. clickthrough rate). If the page or display ad loads slowly, the viewer is more likely to overlook the ad, or navigate to a different page entirely, decreasing the average clickthrough rate. If the page and ad load quickly, the viewer is more likely to click on the ad, increasing the average clickthrough rate.While this definition of real-time allows a range of response times, in practice the range tails off quickly. For instance, the clickthrough rate at 2 seconds load time is likely near 0%. This is what I mean when I say a real-time application is one that is “fast enough” to capture some untapped value. The “real-time” approach of dynamically choosing display ads or bidding on traffic based on user profile information is fundamentally different from the legacy approach of statically serving ads regardless of viewer profile. However, real-time digital advertising is only worth implementing if it can be done fast enough to lift intended user behavior.There are many applications for machines programmatically making decisions in real time, not just digital advertising. Applications include fraud detection, geo-fencing, and load-balancing a datacenter or CDN, to name a few.Humans Acting on Real-Time DataThe other class of real-time applications is where humans respond to events and make data-driven decisions in real time.Despite strides in artificial intelligence and predictive analytics, many business problems still require a human touch. Often, solutions require synthesizing information about a complex system or responding to anomalous events. While these problems require the critical thinking of a human, they are still data-driven. Providing better information sooner lets humans reach a solution faster.Example: Data Center ManagementA good example of this type of problem is managing complex systems like data centers. Some of the management can be automated, but ultimately humans need to respond to unexpected failure scenarios. For online retailers and service providers, uptime directly correlates with revenue.With or without a real-time monitoring system in place, data center administrators can access live server data through remote access and systems monitoring utilities. But standard systems monitoring tools can only provide so much information. The challenge lies in collecting, summarizing, and understanding the flood of machine-generated data flowing from hundreds or thousands of severs. Doing this in real-time has some demanding requirements:Granular logging of network traffic, memory usage, and other important system metricInteractive query access to both recent and historical log and performance data so administrators can spot anomalies and act on themThe ability to generate statistics reporting on recent machine data without tying up the database and blocking new data from being writtenThe third requirement is arguably the hardest, and the one on which the definition of real-time hinges. It entails processing and recording all machine data (an operational or OLTP workload) and aggregating the data into useful performance statistics (a reporting or OLAP workload). The reporting queries must execute quickly without blocking the inflow of new data.Once again, there is no hard and fast rule for what constitutes a real-time window. It could be a second or a few seconds. Rather, the distinguishing feature of a real-time monitoring system is the ability to converge live data with historical data, and to interactively analyze and report on them together. The technical challenge is not simply collecting data, but how quickly can you extract actionable information from it.There are many applications for real-time monitoring beyond data center administration. It can be applied to understand and optimize complex dynamic systems such as an airline or shipping network. It can also be used for financial applications like position tracking and risk management.Moving to Real-Time Data PipelinesWhile the specific numerical values associated with “real-time” may vary between organizations, many enterprises are deploying similar data processing architectures to power data-driven applications. In particular, enterprises are replacing legacy architectures, that separate operational data processing from analytical data processing, with real-time data pipelines that can ingest, serve, and query data simultaneously. SingleStore forms the core of many such pipelines, often used in conjunction with Apache Kafka and Spark Streaming, for distributed, fault-tolerant, and high throughput data processing.
Read Post

Product
Harnessing the Enterprise Capabilities of Spark
As more developers and data scientists try Apache Spark, they ask questions about persistence, transactions and mutable data, and how to deploy statistical models in production. To address some of these questions, our CEO Eric Frenkiel recently wrote an article for Data Informed explaining key use cases integrating SingleStore and Spark together to drive concrete business value.
The article explains how you can combine SingleStore and Spark for applications like stream processing, advanced analytics, and feeding the results of analytics back into operational systems to increase efficiency and revenue. As distributed systems with speedy in-memory processing, SingleStore and Spark naturally complement one another and form the backbone of a flexible, versatile real-time data pipeline.
Read the full article here.
Get The SingleStore Spark Connector Guide
The 79 page guide covers how to design, build, and deploy Spark applications using the SingleStore Spark Connector. Inside, you will find code samples to help you get started and performance recommendations for your production-ready Apache Spark and SingleStore implementations.
Download Here
Read Post

Engineering
Boost Conversions with Overlap Ad Targeting
Digital advertising is a numbers game played out over billions of interactions. Advertisers and publishers build predictive models for buying and selling traffic, then apply those models over and over again. Even small changes to a model, changes that alter conversion rates by fractions of a percent, can have a profound impact on revenue over the course of a billion transactions.
Serving targeted ads requires a database of users segmented by interests and demographic information. Granular segmentation allows for more effective targeting. For example, you can choose more relevant ads if you have a list of users who like rock and roll, jazz, and classical music than if you just have a generic list of music fans.
Knowing the overlap between multiple user segments opens up new opportunities for targeting. For example, knowing that a user is both a fan of classical music and lives in the San Francisco Bay Area allows you to display an ad for tickets to the San Francisco Symphony. This ad will not be relevant to the vast majority of your audience, but may convert at a high rate for this particular “composite” segment. Similarly, you can offer LA Philharmonic tickets to classical fans in Southern California, Outside Lands tickets to rock and roll fans in the Bay Area, and so on.
Read Post

Company
Chris Fry Joins SingleStore Advisory Board
We are excited to announce that web scaling expert and technology growth executive Chris Fry has joined the SingleStore advisory board. Chris will provide technical and organizational expertise to SingleStore as we expand the capabilities and market adoption of our distributed in-memory database.
Scaling expert @chfry former executive at @twitter and @salesforce joins @SingleStoreDB advisory board – Click to Tweet
Chris has broad experience in scaling both technology infrastructure and engineering organizations through high growth phases. Most recently he served as Senior Vice President of Engineering at Twitter where he killed the “Fail Whale” and created a more reliable service, led the rapid growth of the engineering team, drove the strategy for storage and compute, and controlled the technology cost structure for the company’s IPO. Prior to Twitter, Fry was the Vice President of Software Development at Salesforce.com covering all areas including applications, platform, core, and Chatter.
“When people work well together, their output is unimaginably better.”
— Chris Fry, First Round Capital CTO Summit
Chris brings a wealth of knowledge about scaling both software and businesses, and has developed a management philosophy for promoting stability in an engineering organization without sacrificing speed or the ability to innovate. We are looking forward to working with him!
You can read the full news release here.
Read Post

Engineering
SingleStore and Cisco Work Together to Make Real-Time Performance on Hadoop a Reality
While Hadoop is great for storing large volumes of data, it’s too slow for building real-time applications. However, our recent collaboration with Cisco provides a solution for Hadoop users who want a better way of processing real-time data. Using Cisco’s Application Centric Infrastructure including APIC and Nexus switch technology, we’ve been able to demonstrate exceptional throughput on concurrent SingleStore and Hadoop 2.0 workloads.
Here’s How It Works
Cisco’s new networking technology automatically prioritizes smaller packet streams generated by real-time workloads over the larger packet streams typically generated by Hadoop. This enables impressive throughput on clusters running simultaneous SingleStore and Hadoop workloads.
At the Strata + Hadoop conference last week in New York, Cisco demonstrated the solution on an 80 node cluster running both SingleStore and Hadoop. Without additional network traffic, the cluster can serve 2.4 million reads per second from SingleStore’s in-memory database. Without packet-prioritization, the database’s performance drops to under 600 thousand reads per second when a simulated Hadoop workload is added to saturate the cluster’s network. With packet-prioritization, the performance recovers to 1.4 million reads per second, more than doubling the throughput.
Why Does It Matter?
This advance provides the ability to collocate SingleStore, for real-time, mission critical data ingest and analysis, with Hadoop workloads that are less time-sensitive and executed as large batch jobs on historical data.
By combining Hadoop’s storage infrastructure with SingleStore’s real-time data processing ability, businesses get the best of both worlds: real-time analytics with Hadoop scale workloads. As an added bonus, the solution allows businesses to save on hardware costs by running SingleStore and Hadoop together on the same cluster.
If you want to learn more, contact a SingleStore representative at sales@singlestore.com or at (855) 463-7660.
Read Post

Trending
Get in the DeLorean! It’s Time for the Database News Roundup.
Stop it database industry, we’re blushing. Imitation is the sincerest form of flattery and, as the spring conference season gives way to summer vacations, we’ve noticed a flood of announcements from database vendors doing their best SingleStore impressions. Here are a few stories that caught our attention.
Oracle and SQL Server Go In-Memory
The new Oracle 12c and SQL Server 2014 both feature in-memory storage engines on top of their existing disk-based storage. Seek latency and disk contention are dirty little (open) secrets among legacy database vendors, so the move makes sense as the granddaddies of the industry try to update their technology for real-time applications. Oracle even went so far as to title their announcement webcast, “The Future of the Database Begins Soon.”
SingleStore agrees that the Future of Databases is in-memory. It’s kind of our thing. In fact, we’ve had an in-memory database on the market for a while, so I guess the Future is now, or the recent past…we tore a hole in the space-time continuum! Postgres is now also Pregres and cloud computing means using a laptop from your hoverboard.
Anyway, it’s worth noting that these recently announced in-memory storage engines are built on top of legacy technology. Oracle 12c and SQL Server were both originally designed to run on a single machine, not in a distributed environment. You can shard your database, but you’re going to end up doing extra computations client-side and find severe performance degradation when you scale beyond a few nodes. Conversely, SingleStore was originally designed as a distributed database. You get linear performance improvement as you add nodes, sharding is automatic, and the distributed query optimizer makes use of all system resources. This is a Future Oracle and Microsoft have yet to realize.
Spark 1.0 Includes Spark SQL Alpha
If Oracle and Microsoft are the granddaddies of the database industry, then Apache Spark is kind of like a bright but moody teenager. While there has been some confusion recently as to whether Spark is in fact a “speedy Swiss Army Knife1,” the point remains that Spark is an interesting technology and will probably see wider adoption in the future. If you’ve been paying attention, you probably know that the biz is a-buzz about Spark.
Read Post