Author

Carlos Bueno
Product Manager

Engineering
The Lambda Architecture Isn’t
The surest sign you have invented something worthwhile is when several other people invent it too. That means the creative pressure that gave birth to the idea is more general than your particular situation.
Read Post

Product
Why SingleStore Placed a Bet on SQL
In the technology industry, when products or innovations last for a long period of time, they are often here to stay. SQL is a great example of this – it has been around for over 30 years and is not going away anytime soon. When Eric Frenkiel and Nikita Shamgunov founded SingleStore in 2011, they were confident in choosing the SQL relational model as the foundation for their database. But the database industry during that era was clamoring around NoSQL, lauding it as the next great innovation, mostly on the themes of scalability and flexibility. When SingleStore graduated from Y Combinator, a prominent tech incubator, that same year it was the only distributed SQL database in a sea of non-SQL offerings.
SQL has since proven its ability to scale and meet today’s needs. Business analysts seek easy interfaces and analytics for the problems they are trying to solve. Customers want SQL, and like Dan McCaffrey, VP of Analytics at Teespring, happily cite that as a reason for choosing SingleStore. Dan states: “What I really liked about SingleStore was the ANSI SQL support for dynamic querying needs at scale, in a reliable, robust, easy-to-use database.”
Now, with the reconquista of SQL, we are seeing two funny things happening in the market.
One, companies that monetize the Hadoop Distributed File System are adding layers of SQL on top of the Hadoop platform. Two, NoSQL databases are incorporating SQL. NoSQL databases are essentially key value stores, and adding SQL gives them the ability to do some analytics. However, adding a SQL layer is no substitute for the richness of advanced SQL that was built into the SingleStore database. SQL as a layer is just a band-aid solution.
The Gartner Magic Quadrant for Operational Database Management Systems
The latest Gartner Magic Quadrant for Operational Database Management Systems confirms something we have been championing for a while:
“By 2017, all leading operational DBMSs will offer multiple data models, relational and NoSQL, in a single DBMS platform… by 2017, the “NoSQL label will cease to distinguish DBMSs, which will result in it falling out of use.”
For years, SingleStore has supported both a fully-relational SQL model, and a “NoSQL” model, together in the same cluster of machines. This was a bet made by our original engineering team – they understood the powerful appeal of SQL to business users, but also knew the value of the “NoSQL” model of vast scale. For that reason, SingleStore is multi-model, and databases of the future will need to support multiple operations to survive.
Our co-founders were confident back in 2011, and we remain confident with validation from the market, research firms like Gartner, and most importantly from our customers, that SQL is the path forward. We will continue to hone the SQL aspects of our database and champion the lingua franca of the database world.
Read Post

Product
Digital Ocean Tutorial Gets You Up and Running in Minutes
As fun as it is to squirrel around inside the guts of some new technology, it’s sometimes nice to follow a recipe and end up with something that Just Works. For years, Digital Ocean, an up and coming cloud provider, has been producing quality tutorials on how to set up cool software on their virtual machines. Today Ian Hansen published an in-depth tutorial on setting up a three-node SingleStore cluster. Check it out here.
Go to the Digital Ocean tutorial and learn how to install SingleStore in minutes
Once the cluster is running, Ian walks through our DB Speed Test. He then dives into interacting with SingleStore using the stock MySQL client and handling structured and instructed data with our JSON datatype. The next tutorials in the series will deal with sharding strategies, replication, and security.
We’re also lucky to have Ian here at Strata / Hadoop World in NYC to give a talk called “Big Data for Small Teams”, about how Digital Ocean uses SingleStore to unify and analyze their clickstream data with a minimum of fuss.
Read Post

Product
SingleStore Community Edition Available on AWS and Azure Marketplaces
Good news, everyone! Today we’re releasing SingleStore Community Edition on the Amazon AWS and Microsoft Azure Marketplaces. Many of our customers run SingleStore in the cloud, and often their entire infrastructure. An even larger number try SingleStore first on the cloud before pulling it into their production systems.
A great example is VCare, currently using SingleStore deployed on AWS cloud instances to power its online, real-time charging platform. VCare supports Mobile Virtual Network Operators (MVNOs), whose customers pay for a certain number of minutes. With its end-to-end software solution, VCare can verify that there is sufficient balance available for users to make phone calls or send text messages. The entire process takes place in milliseconds to ensure there is no lapse in service, supported by the cloud.
Now it is easier than ever to get started. With just a few clicks you can launch a “cluster in a box” on a single virtual machine, and run our DB Speed Test within minutes. The DB Speed Test is a 30-second performance benchmark that comes with SingleStore Ops, and current frequently push over 1 million inserts per second on a single virtual machine.
Read Post

Engineering
How to Make a Believable Benchmark
A benchmark asks a specific question, makes a guess about the expected result, and confirms or denies it with experiment. If it compares anything, it compares like to like and discloses enough details so that others can plausibly repeat it. If your benchmark does not do all of these things, it is not a benchmark.
Today’s question comes from one of our engineers, who was talking to a customer about new features in SingleStoreDB Self-Managed 4. We added support for SSL network encryption between clients and the cluster, and also between nodes in the cluster. The customer wanted to know how performance would be impacted.
Read Post

Engineering
Finding and Solving Bottlenecks in Your System
Read data in, write data out. In their purest form, this is what computers accomplish. Building a high performance data processing system requires accounting for how much data must move, to where, and the computational tasks needed. The trick is to establish the size and heft of your data, and focus on its flow. Identifying and correcting bottlenecks in the flow will help you build a low latency system that scales over time.Characterizing your systemBefore taking action, characterize your system using the following 8 factors:Working set sizeSet of data a system needs to address during normal operation. A complex system will have many distinct working sets, but one or two of them usually dominate.Average transaction sizeWorking set of a single transaction performed by the system.Request sizeExpected throughput. The combination of throughput and transaction size governs most of the total data flow of the system.Update rateMeasure of how often data is added, deleted, and edited.ConsistencyTime required for an update to spread through the system.LocalityPortion of a working set a request needs access to.ComputationAmount of math needed to run on the data.LatencyExpected time for transactions to return a success or failure.Download the Capacity Planning Cheat Sheet8 system factors to define before capacity planningIdentifying bottlenecksAfter pinpointing these characteristics, it should be possible to determine the dominant operation responsible for data congestion. Your answer might be obvious, but identifying the true bottleneck will provide a core factor to focus on.The pizzeria exampleLet’s say you own a pizza shop and want to make more money. If there are long lines to order, you can double the number of registers. If the pizzas arrive late, you can work on developing a better rhythm. You might even try raising the oven temperature a bit. But fundamentally, a pizza shop’s bottleneck is the size of its oven. Even if you get everything else right, you won’t be able to move more pizzas per day without expanding your oven’s capacity or buying a second one.If you can’t clearly see a fundamental bottleneck, change a constraint and see what shifts in response. What happens if you had to reduce the latency requirement by 10x? Halved the number of computers? What tricks could you get away with if you relax the constraint on consistency? It’s common to take the initial constraints as true and unmoving, but they rarely are. Creativity in the questions has more leverage than creativity in the answers.If you’re looking to build a well-designed computing system, I contributed an in-depth article on Infoq, that provides use cases and real-world examples.
Read Post

Engineering
Geospatial Intelligence Coming to SingleStore
This week at the Esri Developers Summit in Palm Springs, our friends at Esri are previewing upcoming features for the next release of SingleStore, using a huge real-world geospatial dataset.
Esri develops geographic information systems (GIS) that function as an integral component in nearly every type of organization. In a recent report by the ARC Advisory Group, the Geographic Information System Global Market Research Study, the authors stated, “Esri is, without a doubt, the dominant player in the GIS market.”
SingleStore showcases Geospatial features at Esri Developers Summit – Click to Tweet
Everything happens somewhere. But, traditionally, spatial data has been locked away in specialized software that either lacked general database features, or didn’t scale out. With SingleStore we are making geospatial data a first-class citizen: just as easy to use, at scale, at great speed and high throughput, as any other kind of data.
The demonstration uses the “Taxistats” dataset: a compilation of 170 million real-world NYC taxi rides. It includes GPS coordinates of the pickup and dropoff, distance, and travel time. SingleStore is coupled with the new version of Esri’s ArcGIS Server, which has a new feature to translate ArcGIS queries into external database queries. From there we generate heatmaps from the raw data in sub-second time.
Heatmaps are a great way to visualize aggregate geospatial data. The X and Y are the longitude and latitude of “cells” or “pixels” on the map, and the color shows the intensity of the values. From there you can explore the dataset across any number of dimensions: zoom in on an area, filter by time, length of ride, and more.
Read Post

Engineering
Cache is the new RAM
Read Post

Engineering
How SingleStore Distributes Data
The world is moving toward highly-parallel, distributed systems up and down the stack. It started with the app servers on the front lines. Now it’s happening at the data layer. This is generally a good thing, but these kinds of systems can sometimes feel like a foreign country. In this article we’ll talk about how SingleStore spreads the work around a cluster of servers without sacrificing ease of use.
There are two kinds of machines in a SingleStore cluster: aggregators and leaves. The same software runs on both kinds of machines, and they all communicate using the same MySQL-compliant protocol. The only difference is in what role each one is told to play.
Aggregators aggregate
An aggregator is like a SQL proxy sitting in front of your cluster. Under normal circumstances, you only ever talk to the cluster through an aggregator. You connect to it with your favorite SQL client and insert rows, run selects, and manipulate data any way you want. From your perspective you are simply talking to “the” database server. The work of coordinating the parallel operations and combining everything into a single set of results is what an aggregator does. There could be any number of leaves acting in concert to answer your query, or only one. To you it’s completely transparent.
Leaves… are where you leave data
A “leaf” server is where your data gets stored. There’s nothing very complicated about it. Each leaf is an instance of SingleStore that acts independently. In fact, a leaf is barely aware that it’s part of a larger cluster of machines. All it knows is that some user named “root” logs into it from time to time and tells it to store or retrieve some data, and that it sometimes has to replicate data to and from its buddies. If you were to connect to a leaf with a SQL client and poke around, you’d see something like this:
Each of those memsql_demo_N databases is a partition, created and managed by the aggregator. When you query “select count(1) from lineitem” from an aggregator, in the background it gets translated into parallel queries across the cluster.
So where does the data go?
As much as the aggregators try to maintain the illusion of a single consistent database, it helps to know a bit about how things work under the hood. A distributed system is fundamentally different from a single-box database. There are a few ways you can arrange your cluster, each with its tradeoffs.
The simplest SingleStore cluster has one aggregator and one leaf. Let’s ignore the aggregator and focus on the leaf. When you create a database, by default it’s split into 8 partitions. Think of a donkey loaded up with a bunch of packages.
This works just fine. Modern multi-core servers, running an in-memory database like SingleStore, can handle an enormous number of concurrent reads and writes very quickly. Because the data is partitioned even on a single leaf it’s easy to achieve parallelism and use all those expensive CPU cores to their fullest. But there’s only so much load a single donkey can carry. To expand the size of your cluster, you add another one and rebalance.
This also works fine. As data gets inserted through an aggregator, the aggregator will make sure that both donkeys get equal amounts of data. In non-donkey terms, this configuration is like RAID-0. There is only one copy of your data, split up and distributed over multiple devices. As with RAID, you should use donkeys of similar sizes to achieve good balance. This configuration can take you pretty far:
However, there is a potential problem. The more donkeys you add, the more likely it is that any one donkey will stumble or get sick, making part of the dataset unavailable. Since there is only one copy this increases the chances of the entire cluster failing. For best results you want two copies of each partition.
This configuration operates more like RAID 1+0. You can expand your cluster’s capacity more or less infinitely by adding more pairs of donkeys. In the case of a failure, the backup donkey automatically takes up the slack until full capacity is restored.
This doesn’t mean that the backup donkeys are doing nothing. SingleStore high-availability is more fine-grained than that. Remember that each database is split into lots of partitions which are spread across the leaves. The aggregator makes sure each leaf machine gets some of the active partitions and some of the backups, so you can use all of your cluster’s CPUs during normal operation. In other words, your data is striped.
Building a SQL-compliant database that can scale over thousands of machines is bigger deal that it might sound. Most of the hard scaling and performance problems I’ve observed with traditional single-box SQL databases come down to two causes: I/O bottlenecks on the disk, and the fact that it runs on a single box. There’s only so much RAM and CPU power you can cram into a single machine. If the success of Google, Facebook, and Amazon have taught us anything, it’s the importance of scaling horizontally over lots of commodity computers.
But it’s been too easy to throw the baby out with the bath water, to give up a powerful query language in exchange for scale. A distributed “No SQL” system like MongoDB is a relatively easier programming problem than the one we’ve tackled. MongoDB, for instance, doesn’t yet support very high rates of concurrent reads and writes. For years Mongo ignored the hard bits of supporting relational set operations. Even now it has only partial support for joins and aggregations, and of course you have to learn a new query language to use it… a language that seems to be edging closer and closer to SQL itself. Our approach is to cut to the chase and give you a highly-concurrent, distributed SQL database, built out of parts that are straightforward to understand.
A couple of details
There are two kinds of tables in SingleStore: reference tables and distributed. Distributed tables are laid out as described here. Reference tables are intended for smaller data that tends to get joined frequently along many dimensions. To make this more efficient, reference tables are replicated on each leaf and aggregator. In data warehouse terms, the tiny dimension tables are reference tables, and gigantic fact tables are distributed.
There are two kinds of aggregators: the master and the children. There is no real difference between the software running on them. Master is just a role assigned to one (and only one) aggregator at a time. The master is responsible for DDL commands like adding or dropping tables, rearranging partitions, and making sure all the other aggregators have consistent metadata about the cluster. All the normal stuff like inserts and deletes can be handled by any aggregator, even when the master is down. In the event of a failure, you can easily promote any aggregator to be the master, and the cluster marches on.
Read Post

Engineering
The Size & Shape of Your Tables
If you ever catch yourself doing mental arithmetic, or writing complex queries while trying to debug a capacity or performance problem, it’s a sign that your database is missing a feature. The information you need may exist, but it’s scattered.
One of the most frequent feature requests from our users was more detailed statistics about memory use. In the current version of SingleStore tables are stored in memory (hence the name), so memory management and storage capacity planning are essentially the same activity. In SingleStore version 2.6 there is a new table in information_schema that exposes those statistics:
Each row describes a “partition” of a distributed table, or a copy of a reference table. You can roll this table up in lots of ways to generate things like the total size of a table in the cluster.
One thing to watch out for in any distributed database is “skew.” The rows in a database can be distributed around the machines in the cluster randomly, which ensures that every node has more or less the same amount of data. But they can also be sharded according to the values in one or more columns. For example, a “hits” table in a web traffic log could be sharded by the IP address of the client. This has many nice properties, but can also cause the data to be spread unevenly. An ISP might employ a proxy, for example, so a large amount of traffic might appear to come from a single IP address, and send a lot of rows to the same partition.
To measure skew you count up the number of rows in and the space consumed by each partition and see how much they vary from the average. A handy statistical function for this is the “relative standard deviation”, expressed as stddev()/avg(). As a general rule of thumb, anything below 0.05 is an acceptable amount of skew.
It’s nice to have the data available, but typing these queries from memory is tiresome. SingleStore Ops is a performance and monitoring application built on top of the database, and runs inside the browser. Building visualizations that combined table_statistics with the table schema was a joy.
How new features get made
There’s always too much to do. The set of features you could write is nearly infinite, so having a way to choose the right ones to do next is crucial. These new table stats features developed side-by-side in the interplay between our “engine” team, which builds the database, and the “ops” team, which builds monitoring and performance apps on top. Ops folks generally ask for better data, and Engine folks generally ask for better visualizations. Having a demanding customer on the inside shortens the feedback loop immensely and makes for better software.
For example, the ops team wanted more detailed information about the size of values in each column in a table. They wrote a query that scanned every field in every row in every table in every database in the cluster, and set it to run every 60 seconds. This accomplished two things simultaneously. It allowed table statistics UI to be prototyped quickly using real data. It also horrified the engine team so much that they built a way to get at those statistics much more cheaply.
“The best way to get the right answer on the Internet is not to ask a question, it’s to post the wrong answer.”
— Cunningham’s Law
All of these engine features are available today with SingleStoreDB Self-Managed 2.6. The new UI features will ship with SingleStore Ops 3.0 in a few weeks. Enjoy!
Read Post

Engineering
Through a Table, Sparsely
Take a look at your database. What do you see? An alphabetic list of table names. Is that really the most important thing to know? Surely, other facts about your database are more interesting. Maybe you want to know how big the tables are.
Database table sizes, like mountain heights and city populations and book sales, tend to follow a pattern casually called a power-law distribution. A minority of tables hold the majority of data. The rest are a long tail of necessary but small things, a kind of embroidery around the edges. Below is a more meaningful list of tables, sorted by the space they consume.
Read Post

Engineering
Notes on Sizing a SingleStore Cluster
One of the most valuable features a distributed system can have is predictability. When a system is predictable, you can reason about how it will respond to changes in workload. You can run small-scale experiments, and trust that the results will scale linearly. It’s much more pleasant and efficient to tailor the hardware to the task than the other way around. No one likes to waste money on more compute than you need because your RAM is full, or be unable to use the full power of your systems because of an I/O bottleneck. In this article we will walk through a few scenarios and rules of thumb for sizing a SingleStore database cluster.
“The biscuits and the syrup never come out even.” — Robert Heinlein
<
p dir=”ltr”>All reads and writes in SingleStore flow through “aggregator” servers to many “leaf” servers. Each CPU core on each leaf can be thought of as its own instance of SingleStore, talking only to aggregators. Under normal operation, each aggregator is also acting independently. As inserts are sent to a randomly chosen aggregator, their shard keys are hashed and the data is sent to the relevant leaves. Select queries for particular keys are sent only to the relevant leaves, and selects or joins across entire tables are executed in parallel across the cluster.
A shared-nothing architecture has many nice properties. The network behavior is easy to reason about because there is no cross-talk between leaves; a badly-written query can’t suddenly cause congestion on the network. You can use inexpensive in-server storage for durability instead of an expensive SAN. The bandwidth and parallelism of disks directly attached to your leaf servers scales linearly with the size of your cluster — no I/O bottlenecks from many servers fighting over the same shared drives.
Latency is also easy to reason about. Let’s say you have a small database: 120 million records spread over 4 leaves. If it takes 700 milliseconds to do a full table scan, then adding 4 more leaves will almost certainly make your queries twice as fast.
Read Post