In 2016, Pandora chose SingleStore to serve as the data store and foundation for solving a specific business problem.
In this post, we explore:
- The Business Problem — what were we trying to solve?
- The Previous System — a few details
- Requirements — business requirements (and a few technical ones too)
- The Contenders — what did the options look like
- About SingleStore
- How We Implemented SingleStore
- Data Strategy: Columnstore vs. Rowstore
- Hardware Details
- Into the Weeds
- Summary
- The Business Problem
The first step in finding a solution is to clearly define the problem. The goal was not to find some way to solve the problem, but the best way to solve it.
What were we trying to solve? We wanted a dashboard that would allow our Ad sales team to view data about current and former ad campaigns. There were of course more details that described the goal as well as a number of technical challenges along the way, but from a business perspective, the core success criteria really was to simply “build a dashboard”. Without a usable dashboard, no amount of engineering heroics would matter.
The Previous System
When this project started, we already had a system that various teams across the company used on a daily basis. As is the case with many projects, it solved specific needs that made sense at some point, but it was increasingly difficult to keep up with changing requirements and new features. Over time, limitations and shortcomings became clearer and more impactful.
These were some of the biggest limitations in our previous system:
- There was a two-day delay prior to new data being available. Answers about Monday weren’t available until Wednesday.
- Answers to the hardest, most interesting queries were pre-computed. This was great for query performance, but it resulted in a fairly rigid user experience where it wasn’t easy to ask new questions or even slight variations of existing questions.
- Performance was a concern. Keeping up with each new 24-hour period of data took ~20 hours elapsed time on a large Hadoop cluster. As data sets continued to grow, more and more processing time was required to handle one day’s worth of data.
Requirements
The requirements are listed below, in no particular order. At the start of this project, there were many internal conversations that started with, “Have you looked at X? It’s a way to store data and it’s great.” There were many different “X” things that came up. In some cases, we had looked at X, other times not. By starting with a list of requirements, it was easy to see which data storage options might work and which would miss the mark. It’s an Apache Incubator project? We’re not going to use it. Scaling the cluster requires the whole thing to be offline for hours or days? Not doing that. It’s a columnar data store and serves queries quickly, but you have to rebuild the columnar indexes for new data to be visible? Nope.
Deliver results via real-time queries — Instead of pre-computing answers for specific, predefined questions, we wanted a data store that would be fast enough to support real-time queries in less than one second.
New data should be available as soon as possible — We wanted new data to be available much faster than two days. Some performant data stores address query speed by pre-computing fixed answers hours in advance, so this requirement came to mean “no pre-computed answers”. We wanted to get away from the cycle of new data arriving, waiting for nightly jobs, view the results in a dashboard. Real-time queries would also be faster when we have new data to show in the UI in the future. This requirement narrowed the field quite a bit.
Support two years of raw click and impression data — A window of two years exceeded the longest business needs, so we started there. The initial plan was to keep a rolling window of no more than two years’ worth of data due to concerns about increased storage (for columnar data) as well as performance degradation (for both reads and writes). This requirement translated into “hundreds of billions of rows” and “terabytes of data”. As it turns out, the “two year” mark was quite conservative — we’re currently approaching four years of accumulated data (we backfilled prior to 2016) with no observed performance impact and plenty of storage remaining (thanks in large part to a steady 85–90% on-disk compression for columnar data).
It should be fast — As a concrete goal, we wanted sub-second query response over 100’s of billions rows of data. Some data stores claim to be “fast” or “X times faster than Y” (“Presto can be 100 or more times faster than Hive”), but what are the real numbers? For example: if a table has 10 billion rows, how much time will pass between issuing a query and getting a result? Many “fast” data stores produce an answer anywhere from “several seconds” to minutes (or more). Being 100 times faster than a 2-hour Hive query isn’t nearly fast enough to satisfy a user clicking around a dashboard.
Product maturity — While we didn’t have hard requirements about what this meant, it was clear that some options in the mix were significantly more experimental than we were comfortable with. Why was this important? We didn’t want to rely on unproven technology to serve as the foundation for this project. We understood that it was important to choose something which had real production use, ideally at a scale similar to ours, rather than jumping straight into the weeds trying things that have never been done. Some projects can afford the risk vs. reward tradeoff, as well as open-ended timeline, but we could not. We wanted something that would work out of the gate.
Minimal barriers to internal adoption — This requirement addressed the reality of introducing new technology to the team. Some of the options looked like interesting contenders, but had a totally unique, unfamiliar query language. Some options had significant limitations in database drivers. Ideally, we would end up with something that supported standard SQL and had plenty of database drivers available.
24/7 support — When (not “if”!) things go wrong, would we need to rely on our own engineers to understand problems, find solutions, and implement fixes? Hardware issues, operational problems, software bugs, incorrect configuration, human error… we knew from experience that, over time, something will go wrong. What we don’t know is what, when or how to fix it. Is there an option for immediate help?
Needs to be scalable