Recent Articles

Trending
SingleStore Meetups - Year in Review
It has been six months since we began hosting meetups regularly at SingleStore. Our office is located in the heart of SoMa, two blocks from the Caltrain station. At the new San Francisco epicenter of tech startups, we want to meet our neighbors and see what other cool technologies are out there! What better way than over cold brews, local pizza and deep tech talks.
In honor of the first official meetup of 2016, we decided to take a look back at the meetups of 2015, and share highlights from each one. Hope to see you at 534th St on January 21st for an intimate conversation with Chandan Joarder on Building Real-Time Digital Insight at Macys.com!
RSVP for our next meetup: Building Real-Time Digital Insight at Macys.com
Without further ado, we present Meetups: A Year in Review.
Read Post

Data Intensity
Choosing the Right Infrastructure for IoT
The infrastructure of IoT will have a real-time database behind every sensor.
Soon every device with a sensor will blend seamlessly into the Internet of Things, from drones to vehicles to wearables. Device and sensor count predictions range from billions to trillions. With this tidal wave of new devices comes an increasing number of new data streams, converging to make instant analytics on real-time data a tenant of any digital transformation.
Our penchant for instant gratification extends to every time we press a button or ask a question online. Today, data must move at the speed of thought and real-time information brings us as close as possible to the present.
The infrastructure to make this interaction possible ranges from the edge of the network into the core of the data center, and must include a database to support new interactive applications and analytics. Let’s examine a few compelling IoT use cases where turning data into actionable insights is table stakes.
Drones – Managing the Machines
Read Post

Engineering
Investigating Linux Performance with Off-CPU Flame Graphs
The Setup
As a performance engineer at SingleStore, one of my primary responsibilities is to ensure that customer Proof of Concepts (POCs) run smoothly. I was recently asked to assist with a big POC, where I was surprised to encounter an uncommon Linux performance issue. I was running a synthetic workload of 16 threads (one for each CPU core). Each one simultaneously executed a very simple query (`select count(*) from t where i > 5`) against a columnstore table.
In theory, this ought to be a CPU bound operation since it would be reading from a file that was already in disk buffer cache. In practice, our cores were spending about 50% of their time idle:
Read Post

Engineering
The Lambda Architecture Isn’t
The surest sign you have invented something worthwhile is when several other people invent it too. That means the creative pressure that gave birth to the idea is more general than your particular situation.
Read Post

Case Studies
Rethinking Lambda Architecture for Real-Time Analytics
Big data, as a concept and practice, has been around for quite some time now. Most companies have responded to the influx of data by adapting their data management strategy. However, managing data in real time still poses a challenge for many enterprises. Some have successfully incorporated streaming or processing tools that provide instant access to real-time data, but most traditional enterprises are still exploring options. Complicating the matter further, most enterprises need access to both historical and real-time data, which require distinct considerations and solutions.
Of the many approaches to managing real-time and historical data concurrently, the Lambda Architecture is by far the most talked about today. Like the physical aspect of the Greek letter it is named for, the Lambda architecture forks into two paths: one is a streaming (real-time) path, the other a batch path. Thus, it accommodates real-time high-speed data service along with an immutable data lake. Oftentimes a serving layer sits on top of the streaming path to power applications or dashboards.
A Fork in the Road
Many Internet-scale companies, like Pinterest, Zynga, Akamai, and Comcast have chosen SingleStore to deliver the high-speed data component of the Lambda architecture. Some customers have chosen to fork the input stream in order to push data into SingleStore and a data lake, like HDFS, in parallel.
Here is an example of the Comcast Lambda Architecture:
Read Post

Data Intensity
Predictions 2016: the Impact of Real-Time Data
Prediction 1. The industrial internet moves to real-time data pipelines
The industrial internet knits together big data, machine learning, and machine-to-machine communications to detect patterns and adjust operations in near real time. Soon the industrial internet will expand by definition to include the Internet of Things.
The detection of patterns and insights often comes with a price: time. While the goal of machine learning is to develop models that will prove useful, dealing with large data sets means it can take days, weeks or months to reach meaningful discoveries.
We predict that in the very near future, real-time data streams will transform what is possible across the industrial internet, so users can ask critical questions, adjust a process, or see a pattern in the moment. Entire industries such as energy, pharmaceutical and even agriculture will be dramatically impacted by the ability to analyze real-time and historical data together to make business decisions faster.
Prediction 2. Consumer visibility into business gets granular
The world today moves at a different pace than a generation ago. Applications on handheld devices that move us through our day tell us where to eat, how to get from point A to point B, what is the fastest route, everything that is happening in the world, and even what our friends are buying. Data is driving the course of business – and dramatically impacting the consumer experience.
We predict that in a few short years, consumer visibility into business operations will get more granular. For example, look at the transparency that already exists with companies such as Blue Apron and FedEx. Not only do we know exactly what is on the menu week to week at Blue Apron, we can opt out if it is something we do not like, or adjust the delivery times. And FedEx allows consumers to track the entire journey of a package and sometimes even reroute a package to a new delivery destination. More and more companies will adopt transparency for consumers, and in doing so, will build brand loyalty and satiate growing consumer appetite for on-demand services.
Prediction 3. The cost of doing business declines
Just a few years ago the cost of storage was a board room conversation, where CIOs had to justify the rising cost of IT associated with growing data volumes. For many CIOs, storage was an IT line item that was on track to outpace profitability.
Today, the conversation around data storage has changed. Storage is cheap and highly accessible—any business unit within an organization can tap into the cloud. Access to commodity hardware makes rapidly scaling a business possible.
The cost of doing business will further decline as in-memory technologies set computing on a new course. While companies like Amazon provide access to more than a terabyte of memory for just a few dollars an hour via public or private clouds, other companies have created technology that provides relatively low-cost access to terabytes of non-volatile memory, which developers can use instead of storage. In-memory databases use vast stores of memory close to the compute to rapidly process data. Access to more memory means that programmers will be able to write different types of software—propelling the industry toward what is perhaps a new era of applications built on commodity hardware.
We are already seeing verticalized trends for data analytics and applications that can serve up real-time value across healthcare, manufacturing, and retail. What’s next?
Prediction 4. The crowdsourcing of analytics
The world of artificial intelligence (AI) used to lie solidly in the hands of physicists, scientists and researchers and well beyond mass population. Today, AI has shifted and is empowering people all over the world to participate in the analytics process. Crowdflower, for example – now Appen – blends a human-in-the-loop process alongside data science and machine learning to generate insights. Kaggle, another company crowdsourcing for analytics, has built one of the world’s largest communities of data scientists to solve complex challenges through a competitive approach to data science.
Data analysis will be more pervasive, and new applications that empower the data collection process will be broadly embraced. Consider the power of Waze and INRIX, both in use today to crowdsource traffic congestion. While the requirement is the participation of a social community at large, the upside potential is felt much more broadly. The same data collection process could be applied to many more applications to affect and improve society.
Read Post

Engineering
Introducing a Performance Boost for Spark SQL, Plus Python Support
This month’s SingleStore Ops release includes performance features for Streamliner, our integrated Apache Spark solution that simplifies creation of real-time data pipelines. Specific features in this release include the ability to run Spark SQL inside of the SingleStore database, in-browser Python programming, and NUMA-aware deployments for SingleStore.
We sat down with Carl Sverre, SingleStore architect and technical lead for Ops development, to talk about the latest release.
Q: What’s the coolest thing about this release for users?
I think the coolest thing for users is that we now support Python as a programming language for building real-time data pipelines with Streamliner. Previously, users needed to code in Scala – Scala is less popular, more constrained, and harder to use. In contrast, Python syntax is widely in use by developers, and has a broad set of programming libraries providing extensibility beyond Spark. Users can import Python libraries like Numpy, Scipy, and Pandas, which are easy to use and feature-rich compared to corresponding Java / Scala libraries. Python also enables users to prototype a data pipeline much faster than with Scala. To allow users to code in Python, we built SingleStore infrastructure on top of PySpark and also implemented a ‘pip’ command that installs any Python package across machines in a SingleStore cluster.
Read Post

Data Intensity
Characteristics of a Modern Database
Many legacy database systems are not equipped for modern applications. Near ubiquitous connectivity drives high-velocity, high-volume data workloads – think smartphones, connected devices, sensors – and a unique set of data management requirements. As the number of connected applications grows, businesses turn to in-memory solutions built to ingest and serve data simultaneously.
Bonus Material: Free O’Reilly Ebook – learn how to build real-time data pipelines with modern database architectures
To support such workloads successfully, database systems must have the following characteristics:
Modern Database Characteristics
Ingest and Process Data in Real Time\
Historically, the lag time between ingesting data and understanding that data has been hours to days. Now, companies require data access and exploration in real time to meet consumer expectations.
Subsecond Response Times\
As organizations supply access to fresh data, demand for access rises from hundreds to thousands of analysts. Serving this workload requires memory-optimized systems that process transactions and analytics concurrently.
Anomaly Detection as Events Occur\
Reaction time to an irregular event often correlates with a business’s financial health. The ability to detect an anomaly as it happens helps companies avoid massive losses and capitalize on opportunities.
Generate Reports Over Changing Datasets\
Today, companies expect analytics to run on changing datasets, where results are accurate to the last transaction. This real-time query capability has become a base requirement for modern workloads.
Real-Time Use Cases
Today, companies are using in-memory solutions to meet these requirements. Here are a few examples:
Pinterest: Real-Time Analytics\
Pinterest built a real-time data pipeline to ingest data into SingleStore using Spark Streaming. In this workflow, every Repin is filtered and enriched by adding geolocation and Repin category information. Enriched data is persisted to SingleStore and made available for query serving. This helps Pinterest build a better recommendation engine for showing Repins and enables their analysts to use familiar a SQL interface to explore real-time data and derive insights.
Read Post

Data Intensity
Using Oracle and SingleStore Together
We often hear “How can I use SingleStore together with my Oracle database?”
As a relational database, SingleStore is similar to an Oracle database, and can serve as an alternative to Oracle in certain scenarios. Here is what sets SingleStore apart:
SingleStore is a distributed system, designed to run on multiple machines with a massively parallel processing architecture. An Oracle database, on the other hand, resides in a single, large machine, or a smaller fixed cluster size.SingleStore has two primary data stores: an in-memory rowstore and a disk-based columnstore. An Oracle database, on the other hand, has one primary data store – a disk-based rowstore. Oracle does have an in-memory option that allows users to make a columnar copy of its disk-based rowstore data in-memory, but even with that, all in-memory data must first be created on disk.
With its distinct architecture, SingleStore complements Oracle in several cases where users can deploy the two databases side by side. These include:
SingleStore as the real-time analytics engine for an Oracle databaseSingleStore as the ingest layer for an Oracle databaseSingleStore as the stream processing layer for an Oracle database
SingleStore as the Real-Time Analytics Engine for an Oracle Database
Enterprises typically have Oracle databases in place for transactional (OLTP) workloads. In these cases, batch processes (ETL) are typically run at the end of each day to transfer data into a separate Oracle data warehouse for analytical (OLAP) workloads. Within the Oracle data warehouse, data is then aggregated and rolled up for efficient querying.
SingleStore performance eliminates the need for batch processing. Data can be copied from the OLTP Oracle database into SingleStore immediately through Oracle GoldenGate or another change data capture tool, and analytical queries can be performed in real-time.
By eliminating ETL, SingleStore minimizes the time between data coming into the system and analysis being gathered from that data set. Ultimately, this enhances enterprises’ ability to make decisions in real time.
SingleStore as the Ingest Layer for an Oracle Database
For many enterprises using Oracle databases, the rate at which data is inserted into the database can be too large for an affordable Oracle system to handle. Ingest performance for an Oracle database is limited by its disk; all inserts into Oracle need to be persisted to disk at each database commit. The Oracle in-memory option does not help with inserts, as “in-memory” data is just a copy of the disk-resident data. As such, any insert into an Oracle database is limited by disk speeds.
For optimal ingest that takes full advantage of in-memory processing, you need a pure in-memory database like SingleStore. Using SingleStore as the data ingest layer on top of an Oracle database allows ingest at in-memory speed.
SingleStore as the Stream Processing Layer for an Oracle Database
Streaming data has become quite popular, yet the Oracle database was designed long ago, well before data streams from sources like Apache Kafka came about. These streams often have unstructured and high volume data that requires real-time transformation. Processing a data stream with these traits requires a specially designed system. To that end, SingleStore provides an integrated Apache Spark solution called Streamliner. Streamliner makes it easy to deploy Spark within SingleStore for ingesting and enriching data streams. With Streamliner, SingleStore can serve as the stream processing layer in front of an Oracle database.
The Avant-Garde of New Relational Databases
Alex Woodie cites a recent Gartner research report from Adam Ronthal in Meet the Avant-Garde of New Relational Databases; the Gartner report states that “over the next three years, 70 percent of new projects requiring ‘scale-out elasticity, distributed processing and hybrid cloud capabilities for relational applications, as well as multi-data-center transactional consistency’, will prefer an ‘emerging’ database vendor over traditional vendors.”
Enterprises with existing Oracle databases should consider adding an ‘emerging’ database like SingleStore into the mix for benefits of scale out, distributed processing, and memory-first technology.
Learn more about SingleStore at http://www.singlestore.com, and download our Community Edition or free Enterprise Trial to get started today!
Read Post

Trending
Everyone Deserves Nice Things
Software is eating the world! It’s a data explosion! The Internet is now of Things! Scratch that, even better – it is of Everything! Did Big Data just call out IoT on Twitter? Click here to find out. [1]
I kid the Internet. In all seriousness, what a magical time we live in. Moore’s Law means cheap hardware, then next thing you know, Cloud. Internet ubiquity invites the globe to the party. Feats of software engineering that were impossible for nation-states to pull off a decade ago are de rigeur for clever teens. On their mobile phones.
I became disillusioned after a few years as a sales engineer for big ticket software products. I talked to so many operations people who spent all their time putting out fires instead of automating and improving. Even worse, it seemed nearly all the actual users were sad about their applications. These amazing modern conveniences were supposed to make our lives easier, not more difficult. The actual outcome was the exact opposite of the expected outcome. Oh irony!
These thoughts and feelings led me to join Atlassian in 2008. If you have never heard of that company, I reckon you have at least heard of JIRA, Confluence or HipChat. Here was a group making software that people were using voluntarily. Even tiny teams could implement it without breaking the bank, or gratis if they were building open source. Furthermore, the company was totally focused on software development. Agile was rising to prominence. Git went from non-existence to dominance in an eye blink. Software development was undergoing a sea change in the right direction.
This is what brings me to SingleStore. Companies like Google, Facebook, eBay, and Netflix had to feed their databases and infrastructure all the steroids to meet the challenge of true web scale. They, among others, pioneered new ways to ingest and work with previously unimaginable mountains of data. Did you use metric prefixes above giga- in your daily life 10 years ago? Nor did I. Yottabytes of records, anyone?
Being able to handle massive data sets, using them to make real-time decisions resulting in delighted customers is the new nice thing that I believe everyone deserves. That is why I am elated to join SingleStore, focusing on the Community Edition. Imagine what you could build better, stronger and faster if the database underneath was ready to handle anything thrown at it.
If you are already using SingleStore Community Edition, I am very keen to hear what you’re doing with it. If you have a moment, please take this very short survey. Don’t hesitate to hit me up on our public Slack, email or elsewise.
And away we go…
[1] – Citations
Why Software Is Eating The World
Gartner Says Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources
The Internet of Things is revolutionising our lives, but standards are a must
The Next Big Thing for Tech: The Internet of Everything
Read Post
![Building Real-Time Data Pipelines through In-Memory Architectures [Webcast]](https://images.contentstack.io/v3/assets/bltac01ee6daa3a1e14/blt8c64544168c8b0c1/63ea4b699c6cb020b6883ac4/featured_real-time-data-pipelines-webcast.jpg?width=278)
Data Intensity
Building Real-Time Data Pipelines through In-Memory Architectures [Webcast]
In the era of universal connectivity, the faster you can move data from point A to B the better. Equipping your organization with the ability to make frequent decisions in an instant offers information and intelligence advantages, such as staying one step ahead of the competition. This is especially important when incoming data is arriving at a relentless pace, in high volume, and from a variety of devices.
As our customers tap into new sources of data or modify to existing data pipelines, we are often asked questions like: What technologies should we consider? Where can we reduce data latency? How can we simplify data architectures?
To eliminate the guesswork, we teamed up with Ben Lorica, Chief Data Scientist at O’Reilly Media to host a webcast centered around building real-time data pipelines.
Watch the recorded webcast to learn:
Ideal technology stacks for building real-time data pipelinesHow to simplify Lambda architecturesHow to use memory-optimized technologies like Kafka, Spark, and in-memory databases to build real-time data pipelinesUse cases for real-time workloads, and the value they offerExamples of data architectures used by companies like Pinterest and Comcast
Webcast Recording
Read Post

Data Intensity
Market Making with SingleStore: Simulating Billions of Stock Trades in Real Time
I woke up around 7:30 AM on Monday August 24th, checked my phone while lying in bed, and saw that I had lost thousands of dollars in my sleep. Great way to start the week…
I was not alone – few investors escaped “Black Monday” unscathed. The past several months have been full of sudden surges and declines in stock prices, and extreme volatility is apparently the “new normal” for global financial markets. Frequent and dramatic market swings put a high premium on access to real-time data. For securities traders, data processing speed and analytic latency can be the difference between getting burned and getting rich.
The challenge for trading and financial services companies is not only collecting real-time market data, but also making decisions based on that data in real time. Legacy SQL database technology lacked the speed, scale, and concurrency to ingest market data and execute analytical queries simultaneously. This forced companies into buying and/or building complex and specialized systems for analytics. However, as commercially-available database technology has caught up with the speed of financial markets, it is possible to use SQL to build sophisticated applications and analyze real-time financial data.
Simple Quote Generator
While we tend to talk about stocks as having a single, definitive price per share, the reality of securities trading is more complex. Buyers and sellers see different prices, known as bid and ask price, respectively. Public bid and ask prices are actually the best available prices chosen among all bids placed by buyers and all asks offered by sellers.
Beyond getting the best available deal, trading firms have other incentives for tracking best bid and ask prices. For one, the Securities and Exchange Commision requires that brokers always give their customers the best available price, known as National Best Bid and Offer (NBBO).
The script gen.py simulates a distribution of bid and ask quotes for fluctuating stock prices. The model for the generator is very simple and easily modified. The script treats each stock as a random walk. Every iteration begins at a base value, uses a probability distribution function to generate a spread of bids and asks, then randomly fluctuates the stock value. The entire generator is around 100 lines of Python.
Python interpretation is the bottleneck when running on even a small SingleStore cluster. Even still, it achieves high enough throughput for a realistic simulation. As performance bottlenecks go, “the database is faster than the data generator” isn’t such a bad problem.
Read Post

Engineering
Technical Deep Dive into SingleStore Streamliner
SingleStore Streamliner, an open source tool available on GitHub, is an integrated solution for building real-time data pipelines using Apache Spark. With Streamliner, you can stream data from real-time data sources (e.g. Apache Kafka), perform data transformations within Apache Spark, and ultimately load data into SingleStore for persistence and application serving.
Streamliner is great tool for developers and data scientists since little to no code is required – users can instantly build their pipelines.
For instance, a non-trivial yet still no-code-required use case is: pulling data in a comma-separated value (CSV) format from a real-time data source; parsing it; then creating and populating a SingleStore table. You can do all this within the Ops web UI, depicted in the image below.
As you can see, we have simulated the real-time data source with a “Test” that feeds in static CSV values. You can easily replace that with Kafka or a custom data source. The static data is then loaded into the hr.employees table in SingleStore.
Read Post

Trending
Find Your IoT Use Case
As enterprises invest billions of dollars in solutions for the Internet of Things (IoT), business leaders seek compelling IoT use cases that tap into new sources of revenue and maximize operational efficiency. At the same time, advancements in data architectures and in-memory computing continue to fuel the IoT fire, enabling organizations to affordably operate at the speed and scale of IoT.
In a recent webcast, Matt Aslett, Research Director at 451 Research, shared use cases across six markets where IoT will have a clear impact:
Watch the IoT and Multi-model Data Infrastructure Webcast Recording
Industrial Automation
Seen as the ‘roots of IoT’, organizations in the industrial automation sector are improving performance and reducing downtimes by adding automation through sensors and making data available online.
Utilities
When people think about IoT, household utilities like thermostats and smoke alarms often come to mind. Smart versions of these devices not only benefit consumers, but also help utility providers operate efficiently, resulting in savings for all parties.
Retail
Bringing radio-frequency identification (RFID) online allows retailers to implement just-in-time (JIT) stock-keeping to cut inventory costs. Additionally, retailers can provide better shopping experiences in the form of mobilized point-of-sale systems and contextually relevant offers.
Healthcare
Connected health equipment allows for real-time health monitoring and alerts that offer improved patient treatment, diagnosis, and awareness.
Transportation and Logistics
IoT is improving efficiency in transportation and logistics markets by providing benefits like just-in-time manufacturing and delivery, as well as improved customer service.
Automotive
The automobile industry is improving efficiencies through predictive maintenance and internet enabled fault diagnostics. Another interesting use case comes from capturing driving activity, as insurance companies can better predict driver risk and offer discounts (or premiums) based on data from the road.
Finding the Internet of Your Things
To take advantage of IoT, Matt notes that it is paramount to identify what top priorities are in your specific case by asking the following questions:
Are there ‘things’ within your organization that would benefit from greater connectivity?
Can better use be made of the ‘things’ that are already network-ready and the data they create?
Are there ‘things’ outside the organization that would benefit from greater connectivity?
Is there a way to reap value from your customers, partners, or suppliers’ smart devices that would be mutually beneficial?
If you answered ‘yes’ to any of these questions, there is a good chance your organization can improve efficiency with an IoT solution. To get started, watch the recording of the IoT and multi-model data infrastructure webcast and view the slides here:
Read Post

Product
Why SingleStore Placed a Bet on SQL
In the technology industry, when products or innovations last for a long period of time, they are often here to stay. SQL is a great example of this – it has been around for over 30 years and is not going away anytime soon. When Eric Frenkiel and Nikita Shamgunov founded SingleStore in 2011, they were confident in choosing the SQL relational model as the foundation for their database. But the database industry during that era was clamoring around NoSQL, lauding it as the next great innovation, mostly on the themes of scalability and flexibility. When SingleStore graduated from Y Combinator, a prominent tech incubator, that same year it was the only distributed SQL database in a sea of non-SQL offerings.
SQL has since proven its ability to scale and meet today’s needs. Business analysts seek easy interfaces and analytics for the problems they are trying to solve. Customers want SQL, and like Dan McCaffrey, VP of Analytics at Teespring, happily cite that as a reason for choosing SingleStore. Dan states: “What I really liked about SingleStore was the ANSI SQL support for dynamic querying needs at scale, in a reliable, robust, easy-to-use database.”
Now, with the reconquista of SQL, we are seeing two funny things happening in the market.
One, companies that monetize the Hadoop Distributed File System are adding layers of SQL on top of the Hadoop platform. Two, NoSQL databases are incorporating SQL. NoSQL databases are essentially key value stores, and adding SQL gives them the ability to do some analytics. However, adding a SQL layer is no substitute for the richness of advanced SQL that was built into the SingleStore database. SQL as a layer is just a band-aid solution.
The Gartner Magic Quadrant for Operational Database Management Systems
The latest Gartner Magic Quadrant for Operational Database Management Systems confirms something we have been championing for a while:
“By 2017, all leading operational DBMSs will offer multiple data models, relational and NoSQL, in a single DBMS platform… by 2017, the “NoSQL label will cease to distinguish DBMSs, which will result in it falling out of use.”
For years, SingleStore has supported both a fully-relational SQL model, and a “NoSQL” model, together in the same cluster of machines. This was a bet made by our original engineering team – they understood the powerful appeal of SQL to business users, but also knew the value of the “NoSQL” model of vast scale. For that reason, SingleStore is multi-model, and databases of the future will need to support multiple operations to survive.
Our co-founders were confident back in 2011, and we remain confident with validation from the market, research firms like Gartner, and most importantly from our customers, that SQL is the path forward. We will continue to hone the SQL aspects of our database and champion the lingua franca of the database world.
Read Post

Data Intensity
The Benefits of an In-Memory Database
Our CTO and co-founder Nikita Shamgunov recently sat down with Software Engineering Daily. In the interview, Nikita focused on the ideal use cases for an in-memory database, compared to a disk-based store, and clarified how SingleStore compares to MySQL. In this post, we will dig deeper into how we define the ‘in-memory database’ and summarize its benefits.
What is SingleStore?
SingleStore is a high-performance in-memory database that combines the horizontal scalability of distributed systems with the familiarity of SQL.
How do you define an ‘in-memory database’?
An in-memory database, also known as a main memory database, can simply be defined as a database management system that depends on main memory (RAM) for computer data storage. This is in contrast to traditional database systems, which employ disk-based storage engines.
The term ‘in-memory’ is popular now, but it does not tell the whole story of SingleStore. Our preferred description is ‘memory first’, which means RAM is used as a first class storage layer – you can read and write directly to and from memory without touching the disk. This is opposed to “memory only,” which does not incorporate disk as a complimentary storage mechanism.
What type of data lends itself well to an in-memory database?
If you need ultra fast access to your data, store it in an in-memory database. Many real-time applications in the modern world need the power of in-memory database structures.
There are several critical features that set in-memory databases apart. First, all data is stored in main memory. Therefore, you will not have to wait for disk I/O in order to update or query data. Plus, data is loaded into main memory with the help of specialized indexing data structures. Second, data is always available in memory, but is also persisted to disk with logs and database snapshots. Finally, the ability to read and write data so quickly in an in-memory database enables mixed transaction/analytical and read/write workloads.
As you can see, in-memory databases provide important advantages. They have the potential to save you a significant amount of time and money in the long run.
For more information, listen to Nikita’s complete podcast:
http://traffic.libsyn.com/sedaily/memsql_nikita_2.mp3
Test the magic of in-memory for yourself, by downloading a 30-day Enterprise Edition trial or free forever Community Edition of SingleStore at singlestore.com/free.
Read Post

Engineering
Building an Infinitely Scalable Testing System
Quality needs to be architected like any other feature in enterprise software. At SingleStore, we build test systems so we can ship new releases as often as possible. In the software world, continuous testing allows you to make tiny changes along the way and keep innovating quickly. Such continuous testing is an essential task—and on top of that, we compete with large companies and their armies of manual testers. Instead of hiring hordes of testers, we decided to build infinitely scalable test software.
This test system is called Psyduck, and it is extremely powerful. We currently run over 100,000 tests a day on Psyduck, almost double the number of tests from the last release of SingleStore. In order to achieve this, we had to architect Psyduck to scale as we grew.
In this blog post, we will share how we utilize Psyduck to maximize product quality, as well as build an efficient developer workflow.
Any engineering team, regardless of size, needs an infinitely scalable testing system of its own.
Make Testing Easy
The first step in building your test system is to ensure your entire team is on board. You can make testing mandatory, but the best way to develop extraordinary testing, is to make the process easy. This also helps foster an engineering culture where developers are passionate about testing. The SingleStore developer workflow for writing a new feature, including testing it, is engineered to be deliberately easy. See the image below.
Read Post

Product
Digital Ocean Tutorial Gets You Up and Running in Minutes
As fun as it is to squirrel around inside the guts of some new technology, it’s sometimes nice to follow a recipe and end up with something that Just Works. For years, Digital Ocean, an up and coming cloud provider, has been producing quality tutorials on how to set up cool software on their virtual machines. Today Ian Hansen published an in-depth tutorial on setting up a three-node SingleStore cluster. Check it out here.
Go to the Digital Ocean tutorial and learn how to install SingleStore in minutes
Once the cluster is running, Ian walks through our DB Speed Test. He then dives into interacting with SingleStore using the stock MySQL client and handling structured and instructed data with our JSON datatype. The next tutorials in the series will deal with sharding strategies, replication, and security.
We’re also lucky to have Ian here at Strata / Hadoop World in NYC to give a talk called “Big Data for Small Teams”, about how Digital Ocean uses SingleStore to unify and analyze their clickstream data with a minimum of fuss.
Read Post

Data Intensity
Making Faster Decisions with Real-Time Data Pipelines
Read Post

Data Intensity
Build Real-Time Data Pipelines with SingleStore Streamliner
SingleStore Streamliner is now generally available! Streamliner is an integrated SingleStore and Apache Spark solution for streaming data from real-time data sources, such as sensors, IoT devices, transactions, application data and logs.
The SingleStore database pairs perfectly with Apache Spark out-of-the-box. Apache Spark is a distributed, in-memory data processing framework that provides programmatic libraries for users to work with data across a broad set of use cases, including streaming, machine learning, and graph data processing. SingleStore and Spark share many design principles: they are in-memory, distributed, and data-centric. Spark provides an amazing interface to the unique functionality in SingleStore: fast and durable transactions, a real-time hybrid row/column-oriented analytics engine, and a highly concurrent environment for serving complex SQL queries.
The SingleStore Spark Connector, released earlier this year, allows Spark and SingleStore integration, facilitating bi-directional data movement between Spark and SingleStore. The connector generated a lot of interest from users who saw the benefits of using Spark for data transformation and SingleStore for data persistence. A consistent theme in the use cases we saw was the desire to use Spark to stream data into SingleStore with Spark Streaming. SingleStore Streamliner is the result of our work to productize this workflow into an easy, UI-driven tool that makes this process dead simple.
Let’s review the thinking behind some of the decisions we made as we were developing Streamliner.
Early work with Pinterest
Pinterest showcased a Kafka+Spark+SingleStore solution in Strata+Hadoop World last February, which was a collaborative effort with SingleStore. See the Pinterest blog post and the Pinterest demo to learn more. The Pinterest solution leveraged Spark Streaming to quickly ingest and enrich data from Kafka, and then store it in SingleStore for analysis.
Read Post

Trending
Rapid Scaling for Startups: Lessons from Salesforce and Twitter
RSVP for the SingleStore Meetup: 10 Tips to Rapidly Scale Your Startup with Chris Fry, former SVP of Engineering at Twitter
There is nothing more challenging and exciting than experiencing hyper growth at a technology company. As users adopt a technology platform you have to rebuild the technology plane while flying it which can be a harrowing process. I found several approaches to scaling that held true across Salesforce, Twitter, and the startups I now work with on a daily basis. Every company is different but these common problems and solutions should help you on your journey.
Team Structure
The first problem most early stage companies face is how to grow and structure the team. There are common breaking points around 20 people and 150 where what you were doing ceases to function. What should your teams look like while you are simultaneously tackling growth, structure, and scale?
Small teams are the most effective teams, with an ideal size between two and ten people (with smaller being better). Large teams don’t stay in sync while small teams can organically communicate, solve problems and fill in for teammates. You can decompose large teams into autonomous small teams.
The best teams can work autonomously. Make sure that teams have all the resources needed to deliver on their goals and know the boundaries of what they should take on. It’s important to create teams that span technology horizontally to create consistency and vertically to attack user focused problems. Teams need structure so they can deliver on their mission without other teams getting in the way.
Fast Iteration
How do you keep delivering as your company scales? Early in a technology companies startup life many naturally iterate quickly. Unfortunately as companies scale communication and technology issues slow iteration speed. The best thing to focus on is to keep delivering work on a regular quick pace. Creating software is a learning process and each iteration is a chance for the team to learn.
Automation also plays a critical role in maintaining a high quality product and should be developed while you are building features. Remember, quality is free – the better software you build, the more you test it, the faster you can change it.
Retention and Culture
How do you build and maintain a unique engineering culture?
To scale an engineering culture you must have one. Discuss it. Set principles. Teach the team to easily remember and articulate key cultural tenets. Put these tenets in writing to bring on new employees and serve as a reference point. Finally, live the culture you set. Culture is a soft topic and if its not lived from the top it is just words on paper. To steal from Dan Pink I would always focus on delivering autonomy, mastery and purpose to each engineer and the engineering team as a whole and build out the cultural practices from there. For example hackweek or letting people pick what team they would work on every quarter.
For example, at both Salesforce and Twitter we stressed a culture of experimentation and learning. This helped us focus on product and technology innovation and led directly to better product features for our primary platforms. It’s important to invest in the technical infrastructure to support iteration. At Twitter we used Mesos to scale computation and built out distributed storage to make data available anywhere it was needed. Your infrastructure should allow any engineer to put an idea into production in a day.
Learn More Scaling Tips
Chris will be presenting “10 Tips to Rapidly Scale Your Startup” on Thursday evening September 24th at SingleStore headquarters in San Francisco. Visit http://www.meetup.com/memsql to register.
About Chris Fry
Chris Fry was Senior Vice President of Engineering at Twitter, Inc. and before that Senior Vice President, Development, at Salesforce. He is currently an advisor to SingleStore and other startups.
Read Post

Trending
5 Big Data Themes – Live from the Show Floor
We spent last week at the Big Data Innovation Summit in Boston. Big data trade shows, particularly those mixed with sophisticated practitioners and people seeking new solutions, are always a perfect opportunity to take a market pulse.
Here are the big 5 big data themes we encountered over the course of two days.
Real-Time Over Resuscitated Data
The action is in real time, and trade show discussions often gravitate to deriving immediate value from real-time data. All of the megatrends apply… social, mobile, IoT, cloud, pushing startups and global companies to operate instantly in a digital,connected world.
While there has been some interest in resuscitating data from Hadoop with MapReduce or SQL on Hadoop, those directions are changing. For example, Cloudera recently announced the One Data Platform Initiative, indicating a shift from MapReduce
this initiative will enable [Spark] to become the successor to Hadoop’s original MapReduce framework for general Hadoop data processing
With Spark’s capabilities for streaming and in-memory processing, we are likely to see a focus on those real-time workflows. This is not to say that Spark won’t be used to explore expansive historical data throughout Hadoop clusters.
But judge your own predilection for real-time and historical data. Yes, both are important, but human beings tend to have an insatiable desire for the now.
Data Warehousing is Poised for Refresh
When the last wave of data warehousing innovation hit mainstream, there was a data M&A spree that started with SAP’s acquisition of Sybase in May 2010. Within 10 months, Greenplum was acquired by EMC, Netezza by IBM, Vertica by HP, and Aster by Teradata.
Today, customers are suffering economically with these systems which have become expensive to maintain and do not deliver the instant results companies now expect.
Applications like real-time dashboards push conventional data warehousing systems beyond their comfort zone, and companies are seeking alternatives.
Getting to ETL Zero
If there is a common enemy in the data market, it is ETL, or the Extract, Transform, and Load process. We were reminded of this when Riley Newman from Airbnb mentioned that
ETL was like extracting teeth…no one wanted to do it.
Ultimately, Riley did find a way to get it done by shifting ETL from a data science to a data engineering function (see final theme below), but I have yet to meet a person who is happy with ETL in their data pipeline.
ETL pain is driving new solution categories like Hybrid Transactional and Analytical Processing, or HTAP for short. In HTAP solutions, transactions and analytics converge on a single data set, often enabled by in-memory computing. HTAP capabilities are the forefront of new digital applications with situational awareness and real-time interaction.
The Matrix Dashboard is Coming
Of course, all of these real-time solutions need dashboards, and dashboards need to be seen. Hiperwall makes a helpful solution to tie multiple monitors together in a single, highly-configurable screen. The dashboards of the future are here!
Read Post

Trending
Incumbents and Contenders in the $33B Database Market
The database market continues to surprise those of us who have been in it for a while. After the initial wave of consolidation in the late 1990s and early 2000s, the market has exploded with new entrants: column-stores, document databases, NoSQL, in-memory, graph databases, and more. But who will truly challenge the incumbents for a position in the Top 5 rankings? Oracle, IBM, Microsoft, SAP, and Teradata dominate the \$33B database market. Will it be a NoSQL database? Will it be an open source business model?
Ripping and replacing existing databases has been described as heart and brain surgery – at the same time. As such, new entrants must find new use cases to gain traction in the market. In addition, the new use cases must be of enough value to warrant adding a new database to the list of approved vendors. Splitting the world roughly into analytic use cases and operational use cases, we have seen a number of different vendors come and go without seriously disrupting the status quo. Part of the problem appears to be the strategy of using open source as a way to unseat the established vendors. While people seem willing to at least try free software (especially for new use cases), is it a sustainable business model?
The open-source market is growing rapidly. However, it is still less than 2% of the total commercial database market. Gartner’s latest numbers show the open-source database market at only $562M, and the total commercial database market at $33B, in 2014.
Furthermore, databases are complex, carrying decades of history behind them. To match, and ultimately exceed incumbent offerings, the key is not to have armies of contributors working in individual lanes, but rather to have a focused effort on the features that matter most for today’s critical workloads. This is especially true with the increasing number of mixed analytical and transactional use cases driven by the new real-time, digital economy. In the case of MySQL, the most successful open source database product, less than 1% of the installed base pays anything. Monty Widenius, the creator of MySQL, himself pointed this out in a famous post a couple of years ago.
The business model needs to make sense too. The open source world almost never subtracts, it adds: more components, more configurations, more scratches for individual itches. Witness the explosion of projects in the Hadoop ecosystem, and the amount of associated services revenue. A commercial model embeds features into the primary product, efficiently generating value. Today customers seek to consolidate the plethora of extensive data processing tools into fewer multi-model databases.
So, it is likely that the next vendor to win a spot in database history will do so by winning on features and workload applicability, and a proven business model with a primary product roadmap.
However, there are many compelling aspects of the open source model, with three core value propositions: (1) a functional, free version; (2) open-source at the “edges” of the product; and (3) a vibrant community around the product. How can a commercial vendor balance both worlds?
Companies pursuing these strategies include MapR in the Hadoop space. With announcements earlier this summer, SingleStore appears to be heading there too, for operational and analytical databases. They now have a SingleStore Community Edition with unlimited size and scale, and full access to core database features. While the production version of the product requires a paid license, this seems to be a reasonable way to balance the need to support a growing, focused engineering team with core value propositions of an open-source model.
So, the question remains: as the database wars heat up and the market gets crowded, who will prevail to lead the industry? With open-source becoming more mainstream, the true contenders will be the vendors that can offer a symmetry between open-source models and new critical workload features.
Read Post

Trending
Join SingleStore in Boston for Big Data Innovation Summit
The Big Data Innovation Summit kicks off in Boston today, uniting some of the biggest data-driven brands, like Nike, Uber, and Airbnb. The conference is an opportunity for industry leaders to share diverse big data initiatives and learn how to approach prominent data challenges.
We are exhibiting at booth #23 and will showcase several demos: MemCity, Supercar, and Real-time Analytics for Pinterest. On top of that, we will have games and giveaways at the booth, as well as complimentary download of the latest Forrester Wave report on in-memory database platforms. More on what to expect:
Demos
MemCity – a simulation that measures and maps the energy consumption across 1.4 million households in a futuristic city, approximately the size of Chicago. MemCity is made possible through a real-time data pipeline built from Apache Kafka, Apache Spark, and SingleStore.
Supercar – showcases real-time geospatial intelligence features of SingleStore. The demo is built off a dataset containing the details of 170 million real world taxi rides. Supercar allows users to select a variety of queries to run on the ride data, such as the average trip length during a determined set of time. The real-world application of this is business or traffic analysts can monitor activity across hundreds of thousands of vehicles, and identify critical metrics, like how many rides were served and average trip time.
Read Post

Trending
Locate This! The Battle for App-specific Maps
In early August, a consortium of the largest German automakers including Audi, BMW, and Daimler (Mercedes) purchased Nokia’s Here mapping unit, the largest competitor to Google Maps, for \$3 billion.
It is no longer easy to get lost. Quite the opposite, we expect and rely on maps for our most common Internet tasks from basic directions to on-demand transportation, discovering a new restaurant or finding a new friend.
And the battle is on between the biggest public and private companies in the world to shore up mapping data and geo-savvy engineering talent. From there, the race continues to deliver the best mapping apps.
Recently a story on the talent war among unicorn private companies noted
Amid a general scramble for talent, Google, the Internet search company, has undergone specific raids from unicorns for engineers who specialize in crucial technologies like mapping.
Wrapping our planet in mobile devices gave birth to a new geographic landscape, one where location meets commerce and maps play a critical role. In addition to automakers like the German consortium having a stake in owning and controlling mapping data and driver user experiences, the largest private companies like Uber and Airbnb depend on maps as an integral part of their applications.
That is part of the reason purveyors of custom maps like Mapbox have emerged to handle mapping applications for companies like Foursquare, Pinterest, and Mapquest. Mapbox raised \$52.6 million earlier this summer to continue its quest.
Mapbox and many others in the industry have benefitted from the data provided by Open Street Maps, a collection of mapping data free to use under an open license. Of course some of the largest technology companies in the world besides Google maintain their own mapping units including Microsoft (Bing Maps) and Apple Maps.
Investment in the Internet of Things combined with mobile device proliferation are creating a perfect storm of geolocation information to be captured and put to use. Much of this will require a analytics infrastructure with geospatial intelligence to realize its value.
In a post titled, Add Location to Your Analytics, Gartner notes
The Internet of Things (IoT) and digital business will produce an unprecedented amount of location-referenced data, particularly as 25 billion devices become connected by 2020, according to Gartner estimates.
and more specifically
Dynamic use cases require a significantly different technology that is able to handle the spatial processing and analytics in (near) real time.
Of course geospatial solutions have been around for some time, and database providers often partner with the largest private geospatial company, Esri, to bring them to market. In particular, companies developing in-memory databases like SAP and SingleStore have showcased work with Esri. By combing the best in geospatial functions with real-time, in-memory performance, application makers can deliver app-specific maps with unprecedented level of consumer interaction.
Google’s balloons and Facebook’s solar powered drones may soon eliminate the dead zones from our planet, perhaps removing the word “lost” from our vocabulary entirely. Similarly, improvements in interior mapping technology guarantee location specific details down to meters. As we head to this near-certain future, maps, and the rich, contextual information they provide, appear to be a secret weapon to delivering breakout application experiences.
Download SingleStore today to try a real-time database with native geospatial intelligence at: singlestore.com/free.
Read Post

Product4 min Read
Understanding SingleStore in 5 Easy Questions
The funny thing about SingleStore is that it is simultaneously familiar and leading edge. On one hand, it is a relational database…
Read Post

Data Intensity
In-Memory Database Survey Reveals Top Use Case: Real-Time Analytics
To shed light on the state of the in-memory database market, we conducted a survey on the prevalent use cases for in-memory databases. Respondents included software architects, developers, enterprise executives and data scientists1. The results revealed a high demand for real-time capabilities, such as analytics and data capture, as well as a high level of interest in Spark Streaming.
Real-Time Needs for In-Memory Databases
It is no surprise that our survey results highlight real-time analytics as the top use case for in-memory databases. For years, big data was heralded as the future of technology – today, it is a reality for companies big and small. Going real-time is the next phase for big data, and people seek technologies that address real-time data needs above all else. Those who can successfully converge transactional and analytical data processing, see greater efficiency in data management and have an invaluable advantage over their competitors.
Read Post

Engineering
Making Painless Schema Changes
The ability to change a table’s schema without downtime in production is a critical feature of any database system. In spite of this, many traditional relational databases have poor support for it. Quick and easy schema changes was a key advantage of early distributed NoSQL systems, but of course, those systems jettison relational capabilities.
Though conventional wisdom may indicate otherwise, easy schema changes are possible with the relational model. At SingleStore we put careful thought and effort into making sure that ALTER TABLE operations have minimal impact to running workloads. This feature is commonly called an “online” ALTER TABLE. Most relational databases support the notion of an “online” ALTER TABLE, but every vendor has a different definition of what that means. In SingleStore we define a true online ALTER as one that:
1) Does not require doubling the disk or memory use of the table while executing (creating a 2nd copy of the table without destroying the original table is not allowed)
2) Does not lock the table or prevent querying it for long periods of time (read or write) while running (under a second of blocking queries is OK)
3) Does not use excessive system resources while running (CPU, Disk, Network) no matter the size of the table or the workload running against the table
SingleStore is the only distributed relational database able to achieve all three. For example, MySQL Cluster fails to do (1) – it copies the table in many cases. VoltDB, Vertica, and Redshift fail to do (2) – they lock the table throughout the entire ALTER operation, effectively taking down your production system, or requiring tedious juggling of replicas.
Explaining how our ALTER TABLE works it best done by stepping through an example. Let say we wanted to add a column to a table as follows:
CREATE TABLE example(c1 int primary key);
ALTER TABLE example ADD COLUMN c2 VARCHAR(100) DEFAULT NULL;
Consider this diagram while we outline how ALTER runs through four phases of execution in the SingleStore rowstore.
Read Post

Engineering
How to Write Compilers in Modern C++ – Meetup with Drew Paroski
Visit our SoMa headquarters this Wednesday, August 19th for our third official meetup, from 6pm-8pm! This is an exclusive opportunity to learn the art of building compilers from Drew Paroski. Before joining SingleStore, Drew co-created the HipHop Virtual Machine (HHVM) and Hack programming language to support Facebook’s web scale across a growing user base in the billions. Read more about Drew here: http://blog.memsql.com/creator-of-hhvm-joins-memsql/. We will have a delicious Mexican feast complete with appetizers, south of the border brews, and wine.
Compilers maximize application performance by translating a given programming language, like C++, into machine code. The ideal compiler produces very efficient machine code for popular programming languages, which means that programs written in the source language (e-commerce websites, games, social networking sites, you name it) will be able to execute 2x, 5x, 10x faster. Compilers represent a single piece of software that can speed up all kinds of applications.
Drew’s expertise includes computer performance, programming with big data, and the advancement of compilers over the past 20 years. At the meetup, he will outline key considerations for building the best possible compiler, including:
identifying your performance goalsevaluating full-custom approach versus alternativesdeveloping measurement benchmarks
Read Post