Making Painless Schema Changes
Engineering

Making Painless Schema Changes

The ability to change a table’s schema without downtime in production is a critical feature of any database system. In spite of this, many traditional relational databases have poor support for it. Quick and easy schema changes was a key advantage of early distributed NoSQL systems, but of course, those systems jettison relational capabilities. Though conventional wisdom may indicate otherwise, easy schema changes are possible with the relational model. At SingleStore we put careful thought and effort into making sure that ALTER TABLE operations have minimal impact to running workloads. This feature is commonly called an “online” ALTER TABLE. Most relational databases support the notion of an “online” ALTER TABLE, but every vendor has a different definition of what that means. In SingleStore we define a true online ALTER as one that: 1) Does not require doubling the disk or memory use of the table while executing (creating a 2nd copy of the table without destroying the original table is not allowed) 2) Does not lock the table or prevent querying it for long periods of time (read or write) while running (under a second of blocking queries is OK) 3) Does not use excessive system resources while running (CPU, Disk, Network) no matter the size of the table or the workload running against the table SingleStore is the only distributed relational database able to achieve all three. For example, MySQL Cluster fails to do (1) – it copies the table in many cases. VoltDB, Vertica, and Redshift fail to do (2) – they lock the table throughout the entire ALTER operation, effectively taking down your production system, or requiring tedious juggling of replicas. Explaining how our ALTER TABLE works it best done by stepping through an example. Let say we wanted to add a column to a table as follows: CREATE TABLE example(c1 int primary key); ALTER TABLE example ADD COLUMN c2 VARCHAR(100) DEFAULT NULL; Consider this diagram while we outline how ALTER runs through four phases of execution in the SingleStore rowstore.
Read Post
How to Write Compilers in Modern C++ – Meetup with Drew Paroski
Engineering

How to Write Compilers in Modern C++ – Meetup with Drew Paroski

Visit our SoMa headquarters this Wednesday, August 19th for our third official meetup, from 6pm-8pm! This is an exclusive opportunity to learn the art of building compilers from Drew Paroski. Before joining SingleStore, Drew co-created the HipHop Virtual Machine (HHVM) and Hack programming language to support Facebook’s web scale across a growing user base in the billions. Read more about Drew here: http://blog.memsql.com/creator-of-hhvm-joins-memsql/. We will have a delicious Mexican feast complete with appetizers, south of the border brews, and wine. Compilers maximize application performance by translating a given programming language, like C++, into machine code. The ideal compiler produces very efficient machine code for popular programming languages, which means that programs written in the source language (e-commerce websites, games, social networking sites, you name it) will be able to execute 2x, 5x, 10x faster. Compilers represent a single piece of software that can speed up all kinds of applications. Drew’s expertise includes computer performance, programming with big data, and the advancement of compilers over the past 20 years. At the meetup, he will outline key considerations for building the best possible compiler, including: identifying your performance goalsevaluating full-custom approach versus alternativesdeveloping measurement benchmarks
Read Post
Forrester
SingleStore Recognized In

The Forrester WaveTM

Translytical Data
Platforms Q4 2022

How to Deploy SingleStore on the Mesosphere DCOS
Engineering

How to Deploy SingleStore on the Mesosphere DCOS

The Mesosphere Datacenter Operating System (DCOS) is a distributed operating system designed to span all machines in a datacenter. It provides mechanisms for deploying applications across the entire system with a few simple commands. SingleStore is a great fit for deployment on DCOS because of its distributed, memory-optimized design. For example, users can scale computation and storage capacity by simply adding nodes. SingleStore deploys across commodity hardware and cloud, giving users the flexibility to operate with existing infrastructure or build custom hardware solutions. SingleStore and DCOS can optimize and simplify your test and development projects, however it is not a supported configuration and is not recommended for production deployments. In this blog post, we will illustrate an example of how to deploy SingleStore for your development or test environment on a cluster of DCOS-configured machines. Deploying SingleStore on DCOS Users can quickly get started with DCOS by deploying a cluster on Amazon AWS. Mesosphere provides a Mesosphere DCOS template specifically for this purpose, which leverages the AWS CloudFormation infrastructure. Follow the steps on docs.d2iq.com to set up DCOS on AWS. Deploying SingleStore on DCOS is simple with the DCOS command line. Once you have deployed a DCOS cluster and installed the DCOS command-line interface (check out the Mesosphere documentation for more information on this step), simply run the following command on the DCOS command line: `$ dcos package install memsql` At that point, if you check the DCOS web interface, you should see the SingleStore service running:
Read Post
The Resurgence of Scala for Big Data
Engineering

The Resurgence of Scala for Big Data

Big Data Scala by the Bay, Aug 16-18, is shaping up to be an engaging event, and will bring together top data engineers, data scientists, developers, and data managers who use the Scala language to build big data pipelines. At the SingleStore booth, we will showcase how enterprises can streamline this process by building their own real-time data pipelines using Apache Kafka, Apache Spark and operational databases. Many of our customers are moving to this real-time data pipeline: a simplified Lambda Architecture that minimizes overhead while delivering remarkably fast analytics on changing datasets. Learn more here: http://bigdatascala.bythebay.io/. To provide more perspective on the intersection of Scala and in-memory databases, we sat down with, Ben Campbell, our in-house Scala expert. Q: Describe the technical underpinnings of Scala. Scala is notable, and has achieved widespread use, largely because of the way it combines two distinct programming paradigms: object-oriented and functional. Object-oriented programming is, of course, familiar to most with C++ or Java — nearly all programmers have some familiarity with one or both of those languages. Functional programming, on the other hand, is less well-known, having historically been consigned largely to academic theory and niche applications. By combining the two approaches, Scala has been able to do what its functional predecessors have not: achieve widespread adoption by a community largely reared on the object-oriented paradigm. There’s an interesting analogy between Scala and C++, which was the breakout object-oriented  language.  C++ was not the first object-oriented language, nor was it a pure object-oriented language. However, C++ became widely adopted because it bridged the gap between C, a non-object-oriented language in widespread use at the time, and the object-oriented approach. Scala has done something similar: based on Java — it makes use of Java libraries and compiles to Java bytecode through the Java virtual machine — it has been relatively easy to adopt for a generation raised on the object-oriented paradigm. But Scala can also be used in a highly functional manner. So programmers coming from a Java background tend to increasingly embrace Scala’s functional features with time. Q: What is the functional programming paradigm, and how does it differ from alternatives? Functional programming treats computation as a problem of evaluating mathematical functions. On the other hand, object-oriented programming treats computation as a series of changes in state. Functional programming avoids such state changes, and hence there is no requirement for mutable data. Scala is an interesting hybrid of these two approaches — it can be written in a functional style, or in a more traditional Java-like style, with mutable state. Q: Why is functional programming, and hence Scala, important for Big Data? As background, object-oriented programming is useful for projects that involve creating increasingly elaborate objects from simpler primitives. Similarly, functional programming is well-suited for applications that compose increasingly elaborate functions from simpler functional primitives. This is often the case in data science, explaining the growing interest in functional programming approaches. As for Big Data, the term implies a set of problems that are too large to handle with conventional approaches — which generally entails a certain amount of parallelism. However, parallel processing is plagued by changes in state: if two parallel processes are attempting to change the same data, the result might be delayed (at best) or unpredictable (at worst). By reducing or eliminating mutability, functional approaches tend to lead to programs that naturally and simply handle concurrency and scalability. Q: What are some of the important use cases for Scala? Scala gained a lot of publicity in 2009, when Twitter announced it would be adopting the language for much of its backend. Since then, a number of other large enterprises have followed suit. But perhaps the biggest development in Scala has been Apache Spark, the big data processing framework. As somewhat of a successor to Apache Hadoop (whose MapReduce model was itself loosely based on functional processing), Spark is seeing enormous growth in adoption and interest — and the fact that it is written in Scala is drawing many to the language. Other notable implementations of Scala include the messaging queue Kafka, and several mathematical and machine learning libraries (e.g. ScalaNLP and BIDMach). Q: What does the success of Scala bode for the future of the programming landscape? With its hybrid object-oriented / functional approach, Scala will serve as somewhat of a gateway drug, helping to gradually transform the landscape towards more functional approaches. While Scala is fully object-oriented, tools like Slick allow it to interface with relational databases to implement more of a functional-relational approach to data. The increasing interest in scalable functional programming thus dovetails with a resurgence of interest in scalable relational database technologies, such as SingleStore. We hope to see you at Big Data Scala!  For conference details, click here: http://bigdatascala.bythebay.io/.
Read Post
Download the New and Improved SingleStore Ops
Engineering

Download the New and Improved SingleStore Ops

The latest release of SingleStore Ops – version 4.0.34 – is now available for download! In this release, we are offering SingleStore users new features to accelerate productivity. Download SingleStore Ops to get up and running on SingleStore Community Edition or SingleStore Enterprise Edition today. SingleStore Ops downloads and upgrades are available for free to all SingleStore Community and Enterprise users. Here are some of the features in the new SingleStore Ops release: Ops Superusers The new SingleStore Ops comes with an enhanced superuser account that locks down read and write access Superusers can be created with a single command: `memsql-ops superuser-add --password <password\> <username\>` All users will log-in through a screen that looks like this:
Read Post
How to Make a Believable Benchmark
Engineering

How to Make a Believable Benchmark

A benchmark asks a specific question, makes a guess about the expected result, and confirms or denies it with experiment. If it compares anything, it compares like to like and discloses enough details so that others can plausibly repeat it. If your benchmark does not do all of these things, it is not a benchmark. Today’s question comes from one of our engineers, who was talking to a customer about new features in SingleStoreDB Self-Managed 4. We added support for SSL network encryption between clients and the cluster, and also between nodes in the cluster. The customer wanted to know how performance would be impacted.
Read Post
How We Hire Remarkable Engineers
Engineering

How We Hire Remarkable Engineers

Read Post
Run SingleStore in Minutes with Docker
Engineering

Run SingleStore in Minutes with Docker

Evaluating software infrastructure is important, but it should not be difficult. You should be able to try and see quickly whether a piece of core software suits your needs. This is one of the many helpful use cases for Docker. Of course Docker has many more uses, including helping run a 107 node cluster with CoreOS, but this post focuses on the quick start scenario. With an install of boot2docker.io for Mac or Windows, and a pre-configured ‘cluster-in-a-box’ Docker container, you can be on your way to interacting with a distributed system like SingleStore in a few minutes. If you are ready to jump in, head to our Quick Start with Docker documentation. In a nutshell, we have built a ‘quickstart’ container that comes installed with SingleStore Ops for management, a single-node SingleStore cluster, and some sample programs referenced in tutorials. Obviously, this is not the configuration to test drive maximum performance. If that were the case, you would want to take advantage of the distributed architecture of SingleStore across several nodes. But it is a great way to get a sense of working with SingleStore, connecting with a MySQL client, and experiencing the database first hand. If you already have Docker installed, you can jump right in with a few simple commands. Spin up a cluster $ docker pull memsql/quickstart $ docker run --rm --net=host memsql/quickstart check-system $ docker -d -p 3306:3306 -p 9000:9000 --name=memsql memsql/quickstart At this point you can create a database, and interact with the ‘cluster-in-a-box’ install of SingleStore. For example you can run a quick benchmark against SingleStore with the following Docker command `$ docker run --rm -it --link=memsql:memsql memsql/quickstart simple-benchmark` For more information on working with SingleStore and the ‘cluster-in-a-box’ Docker container, visit our documentation at docs.singlestore.com/latest/setup/docker. We’ve also made the Dockerfile available on Github at github.com/memsql/memsql-docker-quickstart. And if you would like to try SingleStore in full, please visit singlestore.com/free. There we have a free unlimited scale and capacity Community Edition and a free 30-day Enterprise Trial.
Read Post
Finding and Solving Bottlenecks in Your System
Engineering

Finding and Solving Bottlenecks in Your System

Read data in, write data out. In their purest form, this is what computers accomplish. Building a high performance data processing system requires accounting for how much data must move, to where, and the computational tasks needed. The trick is to establish the size and heft of your data, and focus on its flow. Identifying and correcting bottlenecks in the flow will help you build a low latency system that scales over time.Characterizing your systemBefore taking action, characterize your system using the following 8 factors:Working set sizeSet of data a system needs to address during normal operation. A complex system will have many distinct working sets, but one or two of them usually dominate.Average transaction sizeWorking set of a single transaction performed by the system.Request sizeExpected throughput. The combination of throughput and transaction size governs most of the total data flow of the system.Update rateMeasure of how often data is added, deleted, and edited.ConsistencyTime required for an update to spread through the system.LocalityPortion of a working set a request needs access to.ComputationAmount of math needed to run on the data.LatencyExpected time for transactions to return a success or failure.Download the Capacity Planning Cheat Sheet8 system factors to define before capacity planningIdentifying bottlenecksAfter pinpointing these characteristics, it should be possible to determine the dominant operation responsible for data congestion. Your answer might be obvious, but identifying the true bottleneck will provide a core factor to focus on.The pizzeria exampleLet’s say you own a pizza shop and want to make more money. If there are long lines to order, you can double the number of registers. If the pizzas arrive late, you can work on developing a better rhythm. You might even try raising the oven temperature a bit. But fundamentally, a pizza shop’s bottleneck is the size of its oven. Even if you get everything else right, you won’t be able to move more pizzas per day without expanding your oven’s capacity or buying a second one.If you can’t clearly see a fundamental bottleneck, change a constraint and see what shifts in response. What happens if you had to reduce the latency requirement by 10x? Halved the number of computers? What tricks could you get away with if you relax the constraint on consistency? It’s common to take the initial constraints as true and unmoving, but they rarely are. Creativity in the questions has more leverage than creativity in the answers.If you’re looking to build a well-designed computing system, I contributed an in-depth article on Infoq, that provides use cases and real-world examples.
Read Post
Boost Conversions with Overlap Ad Targeting
Engineering

Boost Conversions with Overlap Ad Targeting

Digital advertising is a numbers game played out over billions of interactions. Advertisers and publishers build predictive models for buying and selling traffic, then apply those models over and over again. Even small changes to a model, changes that alter conversion rates by fractions of a percent, can have a profound impact on revenue over the course of a billion transactions. Serving targeted ads requires a database of users segmented by interests and demographic information. Granular segmentation allows for more effective targeting. For example, you can choose more relevant ads if you have a list of users who like rock and roll, jazz, and classical music than if you just have a generic list of music fans. Knowing the overlap between multiple user segments opens up new opportunities for targeting. For example, knowing that a user is both a fan of classical music and lives in the San Francisco Bay Area allows you to display an ad for tickets to the San Francisco Symphony. This ad will not be relevant to the vast majority of your audience, but may convert at a high rate for this particular “composite” segment. Similarly, you can offer LA Philharmonic tickets to classical fans in Southern California, Outside Lands tickets to rock and roll fans in the Bay Area, and so on.
Read Post
Turn Up the Volume With High-Speed Counters
Engineering

Turn Up the Volume With High-Speed Counters

Scaling tends to make even simple things, like counting, seem difficult. In the past, businesses used specialized databases for particular tasks, including high-speed, high-throughput event counters. Due to the constraints of legacy systems, some people still assume that relational databases cannot  handle high-throughput tasks at scale. However, due to advances like in-memory storage, high-throughput counting no longer requires a specialized, single-purpose database. Why do we even need counters? Before we get into the implementation, you might be asking why we need counters at all. Why not just collect event logs and compute counts as needed? In short, querying a counter is much faster than counting log records, and many applications require instant access to this kind of data. Counting logs requires a large table scan and aggregation to produce a count. If you have an updatable counter, it is a single record lookup. The challenge with high-throughput counters is that building a stateful, fault tolerant distributed system can be challenging. Fortunately, SingleStore solves those hard problems for you, so you can focus on building your application. In the rest of this article we’ll design a simple robust counter database running on a modest SingleStore cluster, and benchmark how it performs. Counters are records Let’s start by creating the following schema: create database test; use test; create table counters_60 ( time_bucket int unsigned not null, event_type int unsigned not null, counter int unsigned not null, primary key (time_bucket, event_type) ); create table event_types ( event_type int unsigned not null primary key, event_name varchar(128), owner varchar(64), status enum ('active', 'inactive') ); The column time_bucket is the timestamp on the event rounded to the nearest minute. Making the time_bucket and event_type the primary key allows us to easily index events by time and type. insert into counters_60 select unix_timestamp() / 60, 1234, 1 on duplicate key update counter = counter + 1; If a primary key value does not exist, this query will insert a new record into SingleStore. If the primary key value exists, the counter will be incremented. This is informally called an “upsert.” The management of event_types is outside the scope of this article, but it’s trivial (and fast) to join the counter table to a table containing event metadata such as its human-friendly name. Let’s also insert some data into the event_types table: insert into event_types values (1234, 'party', 'memsql', 'active'); Querying Counters Now you have the counts of each event type bucketed by minute. This counter data can easily be aggregated and summarized with simple SQL queries: -- all-time historical counts of various event types select e.event_type, e.event_name, sum(c.counter) from counters_60 c, event_types e where c.event_type=e.event_type and e.event_type in (1234, 4567, 7890) group by 1, 2; -- total number of events in the last hour select sum(counter), sum(counter)/60 as 'avg per min' from counters_60 where event_type = 1234 and time_bucket >= unix_timestamp() / 60 - 60; -- total number of events in time series, bucketed in 10-minute intervals select floor((unix_timestamp()/60 - time_bucket)/10) as interval, sum(counter) from counters_60 where event_type = 1234 and time_bucket >= unix_timestamp() / 60 - 60 group by 1; 1.6 Million increments per second Inserting naively into the counters table, one record at a time, actually gets you pretty far. In our testing this resulted in a throughput of 200,000 increments per second. It’s nice to get impressive performance by default. Then we tried to see how much farther we could go. In this simulation we processed 1,000 different event types. We created a threaded python script to push as many increments a second as possible. We made three changes to the naive version: multi-insert batches, disabling cluster-wide transactions, and sorting the records in each batch to avoid deadlocking. insert into counters_60 values (23768675, 1234, 1), (23768675, 4567, 1), (23768675, 7890, 1), ... on duplicate key update counter = counter + 1; We used a 6 node AWS cluster with 2 aggregators and 4 leaves to simulate the workload. Each node was m3.2xlarge consisting of 8 cores and 15GB of RAM, with an hourly cost of \$2.61 for the entire cluster. When starting this script on both aggregator nodes, we achieved a throughput of 1.6M upserts a second. Data Collection In this simulation we use a Python script to simulate the data ingest. In the real world, we see our customers use technologies like Storm, Kafka and Spark Streaming to collect events in a distributed system for higher throughput. For more information on SingleStore integration with stream processing engines, see this blog post on how Pinterest uses SingleStore and Spark streaming to track real-time event data. Want to build your own high throughput counter? Download SingleStore today!
Read Post
Geospatial Intelligence Coming to SingleStore
Engineering

Geospatial Intelligence Coming to SingleStore

This week at the Esri Developers Summit in Palm Springs, our friends at Esri are previewing upcoming features for the next release of SingleStore, using a huge real-world geospatial dataset. Esri develops geographic information systems (GIS) that function as an integral component in nearly every type of organization. In a recent report by the ARC Advisory Group, the Geographic Information System Global Market Research Study, the authors stated, “Esri is, without a doubt, the dominant player in the GIS market.” SingleStore showcases Geospatial features at Esri Developers Summit – Click to Tweet Everything happens somewhere. But, traditionally, spatial data has been locked away in specialized software that either lacked general database features, or didn’t scale out. With SingleStore we are making geospatial data a first-class citizen: just as easy to use, at scale, at great speed and high throughput, as any other kind of data. The demonstration uses the “Taxistats” dataset: a compilation of 170 million real-world NYC taxi rides. It includes GPS coordinates of the pickup and dropoff, distance, and travel time. SingleStore is coupled with the new version of Esri’s ArcGIS Server, which has a new feature to translate ArcGIS queries into external database queries. From there we generate heatmaps from the raw data in sub-second time. Heatmaps are a great way to visualize aggregate geospatial data. The X and Y are the longitude and latitude of “cells” or “pixels” on the map, and the color shows the intensity of the values. From there you can explore the dataset across any number of dimensions: zoom in on an area, filter by time, length of ride, and more.
Read Post
Load Files from Amazon S3 and HDFS with the SingleStore Loader
Engineering

Load Files from Amazon S3 and HDFS with the SingleStore Loader

One of the most common tasks with any database is loading large amounts of data into it from an external data store. Both SingleStore and MySQL provide the LOAD DATA command for this task; this command is very powerful, but by itself, it has a number of restrictions: It can only read from the local filesystem, so loading data from a remote store like Amazon S3 requires first downloading the files you need. Since it can only read from a single file at a time, loading from multiple files requires multiple LOAD DATA commands. If you want to perform this work in parallel, you have to write your own scripts. If you are loading multiple files, it’s up to you to make sure that you’ve deduplicated the files and their contents. Why We Built the SingleStore Loader At SingleStore, we’ve acutely felt all of these limitations. That’s why we developed SingleStore Loader, which solves all of the above problems and more. SingleStore Loader lets you load files from Amazon S3, the Hadoop Distributed File System (HDFS), and the local filesystem. You can specify all of the files you want to load with one command, and SingleStore Loader will take care of deduplicating files, parallelizing the workload, retrying files if they fail to load, and more. Use a load command to load a set of files
Read Post
Cache is the new RAM
Engineering

Cache is the new RAM

Read Post