Author

Gary Orenstein
Former Chief Marketing Officer, SingleStore

Trending
Matching Modern Databases with ML and AI
IntroductionMachine Learning (ML) and Artificial Intelligence (AI) have stirred the technology sector into a flurry of activity over the past couple of years.However, it is important to remember that it all comes back to data. As Hilary Mason, a prominent data scientist, noted in Harvard Business Review*,…you can’t do AI without machine learning. You also can’t do machine learning without analytics, and you can’t do analytics without data infrastructure.Over the last year we assembled a number of blogs, videos, and presentations on using ML and AI with SingleStore.SingleStore with ML and AIAs a foundational datastore, SingleStore incorporates machine learning functions in one of three ways:Calculations outside the databaseCalculations on ingestCalculations within the databaseRead an overall summary in our blog post Machine Learning and SingleStore from Rick Negrin.Outside the DatabaseFor integrating ML and AI outside the database, two popular methods are integrating with Spark and TensorFlow.For Spark, SingleStore offers and open source SingleStore Spark Connector, which delivers high-throughput, bi-directional, and highly-parallel operations from partition to partition. This connector opens up unlimited ML and AI possibilities that can be combined with a scalable, durable datastore from SingleStore.One example of this integration is real-time machine learning scoring. A stereotypical pipeline might be:`IIoT Collection > Kafka > Spark > SingleStore > Queries`For example, some popular statistical software packages allow export as PMML, the predictive machine learning markup language. These or similar models, can be exported into Spark, or even the database itself, to score incoming data in real time. From there, the datapoint and the score of the model on the datapoint, can be persisted together for easy analysis.SingleStore PowerStream is a specific example of combining machine learning with a database to score data in real time and predict the likelihood of an equipment failure.An interactive demonstration using SingleStore PowerStream simulates 200,000 wind turbines sending sensor information at a rate of approximately 2 million inserts per second to SingleStore. From there, the user interface shows live status on real-time information, and ML scoring predicts the likelihood of turbine failures.Read more in our blog IoT at Global Scale: PowerStream Wind Farm Analytics with Spark.TensorFlow, and the results of applying machine learning models in TensorFlow, are shown in the this video Scoring Machine Learning Models at Scale from Strata New York.TensorFlow is often used in real-time image recognition. This presentation and video from Spark Summit Dublin 2017 highlights the popular use case. And here’s a relevant course on TensorFlow from a provider who focuses on data science, AI, and machine learning: Artificial Intelligence Course and Training.ML On IngestAnother area to run machine learning is as the data arrives in the database. SingleStore enables this with native pipelines from Kafka, including exactly-once semantics, a critical capability for deduplication in event-driven pipelines. Further, the pipeline ingest capability includes the option for executing a custom transformation.A typical real-time data pipeline in this scenario might be:`Kafka > SingleStore > Query/Visualization`SingleStore UI for Custom Transformations
Read Post
Data Intensity
Visual Takeaways from Gartner Data and Analytics 2018
We attended the Gartner Data and Analytics Summit in Grapevine, Texas in early March. This series is part of its global events schedule and similar conferences happen around the world.
One fun hallway display was a series of animated summaries with key themes, tracks, and sessions of the conference. They were created by Katherine Torrini at Creative Catalyst.
The first display covered the primary show themes of scaling the value of data and analytics:
– Establishing Trust in the data foundation
– Promote a culture of Diversity
– Building the Data Literacy of your workforce
– Mastering Complexity of running a digital business
#GartnerDA keynote: The value of data & #analytics is based on mastering #trust #diversity #literacy and #complexity. pic.twitter.com/dQlTiaHFiQ
— Doug Laney (@Doug_Laney) March 5, 2018
Read Post

Trending
Just Like IoT, Enterprises Have Been Using AI All Along
Just like IoT, around in various forms for years, AI has been prevalent in nearly every enterprise for decades. The catch? It was hidden inside the databases and data warehouses in use under the banner of Query Optimization.
Query optimization is the process where a database or data warehouse takes the input from a user’s question, and reworks that question to deliver a quick response with as few compute resources as possible. The number of query plan choices reach far beyond what a human can calculate. An algorithm sifting through choices for the best plan makes applications faster since it receives data more quickly.
Let’s take a closer look at the basics of query optimization, how it aligns with the principles of artificial intelligence, and the benefits delivered to enterprises.
Understanding Query Optimization
When an application or analyst requests information from a database or data warehouse, the question is phrased as a query, most frequently in SQL, the Structured Query Language.
An extremely simple query might be the following:
SELECT * FROM Customers
WHERE Country='Canada';
This simple question requires searching everything in the Customers table to find results where the Country equals Canada.
But SQL is a rich language that offers many permutations of how users or applications can write or assemble queries. Turning the query into a function the database can execute is the core of query optimization. And in general, there is a massive space of possible plans, which can answer a specific question asked by a user.
A more complex query might be the following Query (#17) from the well known TPCH benchmark. According to the benchmark report:
This query determines how much average yearly revenue would be lost if orders were no longer filled for small quantities of certain parts. This may reduce overhead expenses by concentrating sales on larger shipments.
Answering the following Business Question
The Small-Quantity-Order Revenue Query considers parts of a given brand and with a given container type and determines the average lineitem quantity of such parts ordered for all orders (past and pending) in the 7-year database. What would be the average yearly gross (undiscounted) loss in revenue if orders for these parts with a quantity of less than 20% of this average were no longer taken?
The query is expressed as:
SELECT Sum(l_extendedprice) / 7.0 AS avg_yearly
FROM lineitem,
part
WHERE p_partkey = l_partkey
AND p_brand = 'Brand#43'
AND p_container = 'LG PACK'
AND l_quantity < (SELECT 0.2 * Avg(l_quantity)
FROM lineitem
WHERE l_partkey = p_partkey);
Given the quantity section of the query
SELECT 0.2 * Avg(l_quantity)
FROM lineitem
WHERE l_partkey = p_partkey
The system would have to run this SELECT to compute the average quantity order for each part individually.
As an example, a query optimizer can rewrite this correlated subselect to the following:
SELECT Sum(l_extendedprice) / 7.0 AS avg_yearly
FROM lineitem,
(SELECT 0.2 * Avg(l_quantity) AS s_avg,
l_partkey AS s_partkey
FROM lineitem,
part
WHERE p_brand = 'Brand#43'
AND p_container = 'LG PACK'
AND p_partkey = l_partkey
GROUP BY l_partkey) sub
WHERE s_partkey = l_partkey
AND l_quantity < s_avg
This speeds up the query by requiring fewer scans of the `lineitem` table. In the original query, the computation is needed for every distinct part. But with the query optimization rewrite, we compute all of the averages at once for the different parts then use those results to complete the query. More information on the technical details of query optimization are available in this paper from the Very Large Data Bases (VLDB) Conference.
So the query optimizer will make decisions based on the query and how the data is located in the database across tables and columns. There are standard rules that apply for certain query shapes, but very few cases where the query plan chosen is always the same.
World class query optimization generally relies on three things.
Cardinality Estimates
Cardinality estimates include predictions such as how many rows are going to get matched from every table with this query, and once you join the tables, what is the intermediate size of the dataset. Query optimizers will use this information to estimate the costs of potential query plans.
Costing
Costing simulates the expense of the algorithm choices and picks the one it thinks is best. You need to consider and calculate alternative query plans with costing to pick the most appropriate choice. Heuristics and rules can lead towards certain directions, but as with many things, past performance is not a predictor of future results. The statistics are never perfect and the query optimization process involves mistakes and missed estimates that result from some guesses.
In general, query optimization is trying to make the best choice, but the goal is not to pick the single best plan, as much as it is to pick among the best plans so that the downside is limited. Limitations are a fact of life as the statistics will never be perfectly predictive of query results. This process closely resembles many AI workflows in use elsewhere.
In thinking about Estimates and Costing, remember that the estimates do not take anything into account about the data itself. An estimate on one database would be just as valid as an estimate on another database, whereas the Cost represents how expensive a query would be to execute on a specific dataset and potentially a specific cluster configuration.
Data Distribution
Query optimizers need to understand the data distribution in order to figure out how the query plan will work. SQL provides a functional declaration of the desired results but it is up to the database to figure out how to orchestrate the query. Understanding data placement in the system makes this possible. The process is similar to the classic salesperson dilemma of wanting to visit several customers in one day and figuring out how to do it with the most efficiency in traveling between sites. They need a map and a way to measure distance to make an accurate assessment.
How Query Optimization Aligns with AI
Many AI workflows are based on analyzing large quantities of information with algorithms to search for patterns. Query optimization is similar with a very focused approach of calculating predictions for how the system can and should run a particular query.
With queries, it does not have to be complex for there to be a million ways to execute it. These combinations require large amounts of compute power and sophisticated algorithms to calculate the best outcomes.
Query optimization studies the data in your table and learns from the data to make predictions about your query. It then uses those predictions to identify the most efficient ways to run operations. Even when query optimization gets a prediction wrong, it will work over time by collecting more statistics to provide more context. This approach also follows the basic patterns of AI.
Benefits of Query Optimization for Enterprises
Query optimization directly benefits enterprises running data-rich applications in the following ways.
Faster results for better performance
Applications that can access query results quickly allow businesses to operate in real time. Queries that may have taken hours can complete in minutes, and queries that may have taken minutes can complete in seconds. With a powerful query processing engine, many queries can be returned in well under one second. All of this allows businesses to react faster with better decision making.
Fewer compute resource for reduced costs
Well constructed query plans, generated through optimization, consume fewer compute resources allowing companies to process more data at a lower cost.
Support more users with greater efficiency
With powerful query optimization, systems can support more users at a single time. This may include internal analysts where a large number can access a single well-curated dataset. Or it may involve keeping up with thousands or millions of end users interacting with a mobile application simultaneously.
Understanding Query Optimization and Distributed Systems
In a new world of distributed systems, query optimization becomes even more important.
Traditional databases had cost models that involved both CPU and disk I/O use. But in newer distributed systems, an additional important metric is network usage, and how much processing will take place across system nodes, instead of within a single node’s processor and disk.
There are different join algorithm choices that will affect network utilization. For example, the choice of algorithm to evaluate a join between two tables that are distributed among multiple nodes has to choose between broadcasting one or more tables to other nodes compared to joining the tables directly on the nodes. All of this has a direct impact on the networks.
The Future of Query Optimization and AI
We are likely to see continued use of AI techniques in query optimization for years to come. Today, some databases will explore or run parts of the query to see the results then make a dynamic choice on the plan. This puts more “smarts” in your query in that you don’t know the estimates so you can try running two different queries then pick one, or run pieces and decide on A or B.
Research papers explore how to use machine learning in the database to optimize data layout. TensorFlow can be used to run experiments and adapt.
If we adapt the cost model based on experiments, it is easy to envision real-time adjustments in the cost model over time. Statistics help with that over time by learning from the data.
Although a lot of work goes into cost models and statistics, every database workload is different. With the right feedback loop, and with every query acting as another experiment, the database can improve itself over time. However, one company’s database will improve itself over time for THAT company’s queries.
Long term there will continue to be millions, perhaps billions or trillions, of choices for query plans. This mandates an intelligent scheme to explore the entire space, meaning that machine learning and AI techniques will be a large part of query optimization well into the future.
A special thanks to Robert Walzer, Nick Kline, Adam Prout, and Jack Chen from SingleStore engineering for input on this blog post. If you’re interested in joining our team, check out our open positions.
Read Post

Data Intensity
Using SingleStore within the AWS Ecosystem
The database market is large and filled with many solutions. In this post, we will take a look at what is happening within AWS, the overall data landscape, and how customers can benefit from using SingleStore within the AWS ecosystem.
Understanding the AWS Juggernaut
At AWS re:Invent in December 2017, AWS CEO Andy Jassy revealed that the business is at a revenue run rate of \$18 billion, growing 42 percent per year. Those numbers are staggering and showcase the importance Amazon Web Services now plays in the technology plans of nearly every major corporation.
Along with eye popping revenue numbers, AWS has continued to offer an unprecedented number of new features every year. At the conference, AWS claimed it has shipped 3,951 new features and services in the five years since the first keynote address by CTO Werner Vogels.
Read Post

Trending
Top 5 2017 SingleStore Blog Posts
2017 was a good year for the database market as developers, architects, and business leaders explored what is possible with new platforms.
We covered several hot topics on the SingleStore blog including the integration with microservices, recognition by industry analysts, machine learning and image recognition, real-time geospatial analytics, and multi-tenancy in a cloud world.
Here are a few of our favorite posts from 2017.
SingleStore Meets the Microservices Architecture
Link to post
Microservices captured application developer attention in 2017 and our own Dale Deloy shared his take on how SingleStore fits this ecosystem.
Each application service component has its own independent service deployment and flexible data model. The goal is to allow developers to focus on building business logic rather than coordinating with corporate models and infrastructure constraints.
Gartner Magic Quadrant for Data Management Solutions for Analytics
Link to post
This Magic Quadrant showcases leaders in data warehousing, and SingleStore placed as a challenger, recognized in particular for its operational data warehouse capabilities.
Read Post
Product
SingleStoreDB Self-Managed 6 Product Pillars and Machine Learning Approach
SingleStoreDB Self-Managed 6 Pillars and Machine Learning Approach
Want to try SingleStoreDB Self-Managed 6? Click here to get started. Prefer the Release Notes? They are here.
Today marks another milestone for SingleStore as we share the details of our latest release, SingleStoreDB Self-Managed 6. This release encapsulates over one year of extensive development to continue making SingleStore the best database platform for real-time analytics with a focus on real-time data warehouse use cases.
Additionally, SingleStoreDB Self-Managed 6 brings a range of new machine learning capabilities to SingleStore, closing the gap between data science and operational applications.
Product Pillars
SingleStoreDB Self-Managed 6 has three foundational pillars:
ExtensibilityQuery PerformanceEnhanced Online Operations
Let’s explore each of these in detail.
Extensibility
Extensibility covers the world of stored procedures, user defined functions (UDFs), and user defined aggregates (UDAs). Together these capabilities represent a mechanism for SingleStore to offer in-database functions that provide powerful custom processing.
For those familiar with other databases, you may know of PL/SQL (Procedural Language/Structured Query Language) developed by Oracle, or T-SQL (Transact-SQL) jointly developed by Sybase and Microsoft. SingleStore has developed its own approach to offering similar functions with MPSQL (Massively Parallel Structured Query Language).
MPSQL takes advantage of the new code generation that was implemented in SingleStoreDB Self-Managed 5. Essentially we are able to use that code generation to compile MPSQL functions. Specifically we implement native machine code for stored procedures, UDFs, and UDAs in-lined into the compiled code that we generate for a query.
Long story short, we expect MPSQL to provide a level of peak performance not previously seen with other databases’ custom functions.
SingleStore extensibility functions are also aware of our distributed system architecture. This innovation allows for custom functions to be executed in parallel across a distributed system, further enhancing overall performance.
Benefits of extensibility include the ability to centralized processes in the database across multiple applications, the performance of embedded functions, and the potential to create new machine learning functions as detailed later in this post.
Query Processing Performance
SingleStoreDB Self-Managed 6 includes breakthrough improvements in query processing. One area is through operations on encoded data. SingleStoreDB Self-Managed 6 includes dictionary encoding, which can translate data into highly compressed unique values that can then be used to conduct incredibly fast scans.
Consider the example of a public dataset about every airline flight in the United States from 1987 until 2015, as outlined in our blog post Delivering Scalable Self Service Analytics.
With this dataset SingleStore can encode and compress the data, allowing for extremely rapid scans of up to 1 billion rows per second per core.
SingleStoreDB Self-Managed 6 also makes use of improvements to the Intel advancements with Single Instruction, Multiple Data (SIMD). This technique allows the CPU to complete multiple data operations in a single instruction, essentially vectorizing and parallel processing the query.
Read Post

Trending
Migrating from Traditional Databases to the Cloud
The Current Organizational Data Dilemma
Today, many companies have years of investment in traditional databases from Oracle, SAP, IBM, Microsoft and others. Frequently these databases and data warehouses have long outlived their ability to cost-effectively perform and scale. These solutions also offer mixed options when it comes to cloud deployments.
Over the past decade, data-driven organizations looked to new methods such as Hadoop and NoSQL to solve data challenges. Hadoop has proven to be a good place to store data, but a challenging place for production data workflows. Gartner noted at a recent conference that only 17 percent of Hadoop deployments are in production in 2017. Also adding a ‘SQL layer” on top of the Hadoop Distributed File System is not a path for building a robust, transaction-capable datastore.
Similarly, NoSQL is a great place to store simple key-value pairs, but a challenging place for analytics. A prominent Gartner analyst notes, “The term NoSQL, in my opinion is meaningless and useless… As the term NoSQL refers to a language, the better term for these DBMSs would be non-relational.” Non-relational, of course, can make sophisticated analytics tricky.
Given these lackluster alternatives, organizations need a modern, scalable solution for database and data warehouse workloads that frees them to deploy flexible data architectures of their choosing. Such a solution would also include seamless paths to the cloud across all workloads.\
Dilemma Snapshot
Organizations are awash in on-premises database and data warehouse deployments from traditional vendorsThese solutions are often coupled with technology lock-in, especially in the form of hardware appliances and the restriction to run in specified environments; cloud options are limitedMany traditional solutions have massive complexity, including hundreds of thousands of lines of custom code due to shortcomings of legacy technologyShortcomings resulted in bolt-on solutions to address scale, speed, and concurrency issues
Data Architecture Challenges Ahead
Handling data securely, cost-effectively and at scale requires effort across multiple areas.\
Cost Reduction\
Organizations continually seek to reduce license, maintenance and operational costs of data solutions. Legacy offerings that require specific hardware appliances no longer meet the needs of modern workloads.\
Contract and Licensing Complexity\
Complex contract and licensing models for traditional solutions make them impractical for organizations aiming to move quickly, adapt to changing conditions, and drive results with new data streams.\
Performance at Scale\
Hardware-centric appliances or single-server solutions that do not automatically distribute the workload are not capable of delivering performance at scale.\
Security Requirements\
With data at the core of every organization, security remains critical and databases and data warehouses must provide a comprehensive and consistent security model across deployments from on-premises to the cloud.\
Cloud Choices\
As the world goes cloud, organizations need the ability to quickly and safely move data workflows and applications to a cloud of their choice. These migrations should happen seamlessly and without limitations. An option to choose clouds (public, private, hybrid, and multi-cloud) should be on the table at all times.
Customer Use Cases
Here’s how organizations have moved from Oracle while retaining options for on-premises and the cloud.
Moving from Oracle to SingleStore instead of a Data Lake
In this first use case we explore a large financial institution moving to SingleStore after trying unsuccessfully to move data applications from traditional datastores to Hadoop.\
Need\
The overarching need was driven by a desire to move off of legacy systems, such as Oracle, Netezza, SybaseIQ, and Teradata. These systems were not achieving the required levels of performance, and were becoming painfully expensive for the organization to support.\
Exploration\
The data team at the bank initially attempted to migrate to a Hadoop-based data lake. Unfortunately, data application migration was taking approximately one year per application. Much of the delay was due to requirements to retrain relational database management system (RDBMS) developers on new languages and approaches such as MapReduce, and using SQL layers that “kind of” provide SQL functionality, but frequently fall short in terms of SQL surface area and robustness.\
Solution\
Adding SingleStore as a data application destination, with complete coverage for INSERT, UPDATE, and DELETE commands, reduced application migration time down to one month. This new solution using SingleStore now supports one of the largest data warehouses in the bank. Additionally, the workflows that were migrated to SingleStore are completely cloud-capable.
Moving from Oracle to SingleStore Instead of Exadata
A large technology company faced a performance need for its billing system. It had plenty of Oracle experience in-house, but the Oracle solutions proposed were priced prohibitively and could not deliver the necessary performance.\
Need\
As part of an expansion plan for its billing system, the large technology company sought an effective and affordable solution to expand from 70,000 to 700,000 transactions per second.\
Exploration\
The team briefly considered a solution with Oracle Exadata which topped out at 70,000 transactions per second. Part of this is due to the fact that Oracle ingests data using the traditional I/O path, which cannot keep up with the world’s fastest workloads.
SingleStore on the other hand, using a distributed, memory-optimized architecture, achieved six million UPSERTs per second in initial trials. After server and network optimizations, that same solution is nearing 10 million UPSERTs per second.\
Solution\
Implementing SingleStore helped save millions in avoiding the high cost of Oracle solutions.\
UPSERTs provide computational efficiency compared to traditional INSERTs. For example, with UPSERTs, data is automatically pre-aggregated allowing for faster queries. Additionally, SingleStore configurations perform on low-cost-industry-standard servers, providing dramatic savings on hardware, and an easy path to move to the cloud.
Moving from Oracle to SingleStore and Introducing Tableau
A global fixed-income investment firm with nearly \$500 billion in assets under management set a strategy to regain control of database performance and reduce its reliance on Oracle.\
Need\
The firm faced a dilemma with its existing Oracle configurations in that direct, end user queries negatively impacted performance. At times, queries from individual users impacted the database for the company as a whole. Concurrent with this performance challenge was an organizational desire to “not bet the farm on Oracle.”\
Exploration\
The team looked for a solution that would provide the scale and performance needed from a database perspective but also the ability to implement ‘query’ governance using a business intelligence tool.\
Solution\
Ultimately the team chose SingleStore due to its RDBMS foundation, and the ability to scale using a distributed, memory-optimized architecture. For business intelligence, the firm selected Tableau, and Tableau includes a native SingleStore connector. The organization now has a cloud-capable architecture that allows it to move to any public cloud of its choice at any time.
Creating a Migration Plan
Migrating from traditional databases and data warehouses takes planning, but more importantly it takes organizational effort and a will to move to new solutions. Here are several tips that can streamline your journey.
Assessment
You can begin by taking an inventory of database and data warehouse infrastructure, including systems from Oracle, SAP, Microsoft and IBM, as well as Netezza, Vertica, Greenplum and others. These products span the database to data warehouse spectrum.
For products with a long history, aim to determine the degree of stickiness due to custom code. For example, Oracle stored procedures are widely used in many organizations and reliance on this custom code factors heavily in the ease of migration.
Put yourself in a position to succeed by identifying older data applications that have struggled with performance or become painfully expensive.
Planning
Next, identify easy applications for migration. For example, consider the following popular use cases for quick wins, high success potential, low risk, rapid cost reduction, and immediate impact.\
Exadata Replacements\
Exadata replacements are appropriate in cases where customers have been required to buy Exadata for performance, but may not need the entire feature set. Similar reasoning can be applied to SAP HANA and IBM Db2.\
Large-scale Databases\
Often for real-time data warehousing solutions, customers prefer to consolidate many databases into a single deployment. This can be accomplished with a distributed system that scales well across dozens or hundreds of nodes.\
High-speed Ingest\
For newer workloads, especially those around internet-scale tracking, IoT sensors, or geospatial data collection, capturing all of the data is critical to the solution. Legacy datastores tend to struggle with these configurations whereas newer distributed architectures handle high speed ingest easily.\
Low-latency Queries\
Query performance is a perennial database and data warehouse requirement. However, many legacy solutions are not equipped to handle the real-time needs of modern workloads, in particular the ability to handle sophisticated queries on large datasets, especially when that dataset is constantly changing. Systems that can handle low latency queries on rapidly updating data serve modern workloads effectively.\
High Concurrency\
When data applications reach a level of adoption they face a blessing and a curse. On the one hand, continued adoption validates the application success. On the other hand, too many concurrent users can negatively impact performance. Data applications that demand a high level of user concurrency are solid candidates to move to a more powerful, memory-optimized distributed system.
Using SingleStore and Oracle Together
SingleStore fits data ecosystems well, as exemplified by several features. First, SingleStore offers users flexible deployments – whether it’s hybrid cloud, on-premises, VMs, or containers. Second, there are tools such as the SingleStore Spark Connector for high-speed, highly-parallel connectivity to Spark, and SingleStore Pipelines which lets you build real-time pipelines and import from popular datastores, such as HDFS, S3, and MySQL. And SingleStore is a memory-first engine, designed for concurrent data ingest and analytics. These ingredients make SingleStore a perfect real-time addition to any stack.
Many customers combine SingleStore with traditional systems, in particular Oracle databases. SingleStore and Oracle can be deployed side-by-side to enhance scalability, distributed processing, and real-time analytics.\
SingleStore for Real-Time Analytics\
Data can be copied from Oracle to SingleStore using a data capture tool, and analytical queries can be performed in real time.\
Read Post

Trending
Modern Data Warehousing, Meet AI
We are enchanted by the possibility of digital disruption. New computing approaches, from cloud to artificial intelligence and machine learning, promise new business models and untold efficiencies. We are closing the gap between science fiction and business operations.
A Quick Look Back
Let’s take a quick look back at data processing, and then come back to the industry frontier.
It started with data and the place to put it, which became the database. Then came a desire to understand the data through analytics, and that spawned the introduction of data warehouses. Data warehouses are a form of databases more suited to analytics. Ultimately databases and data warehouses are both datastores, and within the data industry these terms sometimes merge.
The Current Data Landscape
Today, the transactional operations of many business applications are in fine shape. In particular, transactions driven by human or business-to-business interactions, do not require significant computing resources.
Often core transactional systems, or databases, can benefit from faster analytics, which do not need to disrupt the transactional workflow. For example, adding a real-time data warehouse to the architecture brings instant insights to drive business decisions.
A New Class of Transaction
But there is a new class of modern transactions ready to handle data ranging from IoT sensor information to website traffic logs to global financial reporting. The volume of transactions drives a need beyond what a traditional database or data warehouse can accept. Enter the real-time data warehouse.
A Modern, Real-Time Data Warehouse
Entirely new systems are needed to capture modern transactional, event, and streaming data. A real-time data warehouse fits this need by being able to ingest and persist data in real time while simultaneously serving low latency analytic queries to large numbers of simultaneous users.
In dealing with such large volumes of data, represented by all forms of ‘modern transactions’, a real-time data warehouse also needs the ability to use machine learning to help harness insights from a vast array of live inputs.
Fast Machine Learning Built-In
Incorporating machine learning with your real-time data warehouse leads to a powerful, simplified data infrastructure. Getting there is straightforward.
Step 1 – Identify a modern transactional workload
This could be any volume of data that pushes the limits of your organization’s existing data systems. Even if you pull data from multiple sources, you want to hone your skill set at rapidly ingesting large volumes. Good examples might be data coming from a message queue such as Kafka, or coming in from Hadoop or S3.
Step 2 – Derive immediate insight with SQL
With a real-time data warehouse, you can ingest data, including transactional data, and immediately access that data with SQL. This provides a powerful, efficient, and universal approach to data exploration. It further opens access to a wide range of technical and business analysts at any company who are familiar with SQL.
Step 3 – Delve into ML and AI
Certain real-time data warehouses, such as SingleStore, have machine learning capabilities, including:
DOT_PRODUCT to compare vectors directly in SQLK-means clustering using extensibilityBi-directional, high throughput, highly parallel connector to Apache Spark
By incorporating these capabilities within the real-time data warehouse, organizations can dramatically simplify data architectures and provide wide access to real-time information for faster critical decisions.
Today, we launched our latest O’Reilly book, Data Warehousing in the Age of Artificial Intelligence. To learn more, check out the snapshot below and download our latest ebook.
What’s Inside?
Chapter 1: The Role of a Modern Data Warehouse in the Age of AI
Enterprises are constantly collecting data. Having a dedicated data warehouse offers rich analytics without affecting the performance of the application. A modern data warehouse can support efficient query execution, along with delivering high performance transactional functionality to keep the application and the analysis synchronized.
Chapter 2: Framing Data Processing with ML and AI
The world has become enchanted with the resurgence in AI and ML to solve business problems. All of these processes need places to store and process data. For modern workloads, we have passed the monolithic and moved on to the distributed era, and we can see how ML and AI will affect data processing itself.
Chapter 3: The Data Warehouse Has Changed
Decades ago, organizations used transactional databases to run analytics. Then applications evolved to collect large volumes and velocity of data driven by web and mobile technologies. Recently, a new class of data warehouse has emerged to address the changes in data while simplifying the setup, management, and data accessibility.
Chapter 4: The Path to the Cloud
There is no question that, whether public or private, cloud computing reigns as the new industry standard. When considering the right choices for cloud, data processing infrastructure remains a critical enablement decision.
Chapter 5: Historical Data
All of your business’s data is historical data; it represents events that happened in the past. In the context of your business operations, “real-time data” refers to the data that is sufficiently recent to where its insights can inform time sensitive decisions. Historical data itself might not be changing, but the applications, powered by models built using historical data, will create data that needs to be captured and analyzed.
Chapter 6: Building Real-Time Data Pipelines
Building any real-time application requires infrastructure and technologies that accommodate ultra-fast data capture and processing. A memory-optimized data warehouse provides both persistence for real-time and historical data as well as the ability to query streaming data in a single system.
Chapter 7: Combining Real Time with Machine Learning
Machine learning encompasses a broad class of techniques used for many purposes, and in general no two ML applications will look identical to each other. This is especially true for real-time applications, for which the application is shaped not only by the goal of the data analysis, but also by the time constraints that come with operating in a real-time window.
Chapter 8: Building the Ideal Stack for Machine Learning
An ML “stack” is not one dimensional. Building an effective ML pipeline requires balance between two natural but competing impulses – use existing technology or build something yourself. Although many ML algorithms are not (or not easily) parallelizable, this is only a single step in your pipeline.
Chapter 9: Strategies for Ubiquitous Deployment
As companies move to the cloud, the ability to span on-premises and cloud deployments remains a critical enterprise enabler. In this chapter, we take a look at hybrid cloud strategies.
Chapter 10: Real-Time Machine Learning Use Cases
Real-time data warehouses help companies take advantage of modern technology and are critical to the growth of ML, big data, and AI. Companies looking to stay current need a data warehouse to support them. If your company is looking to benefit from ML and AI and needs data analytics in real time, choosing the correct data warehouse is critical to success.
Chapter 11: The Future of Data Processing for Artificial Intelligence
Maximizing value from ML applications hinges not only on having good models, but on having a system in which the models can continuously be made better. The reason to employ data scientists is because there is no such thing as a self-contained and complete ML solution. In the same way that the work at a growing business is never done, intelligent companies are always improving their analytics infrastructure.
Read Post

Product
The Real-Time Data Warehouse for Hybrid Cloud
As companies move to the cloud, the ability to span on-premises and cloud deployments remains a critical enterprise enabler.
In this post, we’ll review the basics of a hybrid cloud model for SingleStore deployments to fit conventional and future needs.
SingleStore Background
SingleStore is a real-time data warehouse optimized for hybrid cloud deployments, and excels at operational use cases. SingleStore users are drawn to the ability to load and persist live data at petabyte scale while simultaneously querying that data in real-time.
SingleStore supports deployment on any cloud including offering a managed service called SingleStoreDB Cloud. Today, nearly half of SingleStore customers use cloud offerings in production or test and development deployments. On-premises SingleStoreDB Self-Managed can be installed and managed by the enterprise.
An Introduction to the SingleStore Hybrid Cloud Model
The SingleStore model begins with the fundamental design principle of a flexible software footprint. With this principle, SingleStore can be run on-premises and in the cloud.
SingleStore in the Cloud
In the cloud, SingleStore offers a managed service where customers can use the database, but do not have to administer the service. This provides any company with the ability to benefit from SingleStore scale and performance, with SingleStore handling cluster operations. When run in the cloud, SingleStore takes advantage of the cloud platform such as low cost scalable storage to store backups and data, and readily available compute to scale query performance and concurrency. SingleStore can also run on any cloud infrastructure-as-a-service.
SingleStore Self-Managed
On-premises SingleStoreDB Self-Managed runs on any modern Linux operating system. It can run on one or hundreds of nodes. There are no dependencies on any hardware, and no dependencies on any cloud provider.
Building on this architectural premise, SingleStore has evolved to have the most advanced database deployment models available.
Building on the flexible software footprint, SingleStore can be deployed:
on any server, with a minimum of 4 cores and 8GB of RAMon any modern version of Linuxwithin a virtual machinewithin a container for test and developmenton any public cloud provider’s infrastructure-as-a-service platformas a cloud service, managed by SingleStore
This unparalleled flexibility allows enterprises to make big bets on their analytics infrastructure while simultaneously controlling deployment and cost levers throughout the lifetime of an application.
Read Post

Data Intensity
The Analytics Race Amongst The World’s Largest Companies
The Analytics Race Amongst The World’s Most Valuable Companies
Read Post

Data Intensity
Real-Time and The Rise of Nano-Marketing
The tracking and targeting of our online lives is no secret. Once we browse to a pair of shoes on a website, we are reminded about them in a retargeting campaign. Lesser known efforts happen behind the scenes to accumulate data and scan through it in realtime, delivering the perfect personalized campaign. Specificity and speed are converging to deliver nano-marketing.
If you are a business leader, you’ll want to stay versed in these latest approaches. If not, as a consumer, you’ll likely want to understand how brands are enabling their craft to you personally.
Brands seek specific customer interactions. If you sign up for a retailer’s newsletter, you might receive a preferences questionnaire so they can tailor everything to your specific wants or needs.
But speed also matters, as many of the largest marketing-driven industries like fashion, TV, movies, and music, depend on relevancy in the moment. Being current is currency itself. Only through real-time interaction can this be achieved.
Looking ahead, leaders of digital initiatives will expand their focus from today’s notion of personalized marketing to “nano-marketing” using tools to predict granular audience cohorts on the fly and prescribe individualized marketing experiences in real time. Brands can increase customer experience directly through context, by individual interaction, and instantaneously.
For example, when you walk into a furniture showroom where you also have an online account, the sales representative should know what you were searching for before you arrived, even if it was just a few hours ago. And they should have easy access to your Pinterest page if you’ve made that public. These are the types of experiences we can expect in the future with nano-marketing.
Behind nano-marketing, taking personalized marketing to the next level
The concept behind personalized marketing is hardly new. Brands have always strived to create special experiences for customers in order to entice them to return. With the creation of Customer Relationship Management (CRM) systems and the proliferation of social media, this idea has become even more popular.
Marketing to customer segments of one merges existing and new disciplines to the trade. The low bar for what currently qualifies as “personalized marketing” will soon rise with the advent of tools that allow finer granularity, faster.
Looking ahead, we can expect three areas of marketing innovation:
The Autonomous Marketing Stack
Marketers have a plethora of available tools across infrastructure and analytics, including platforms like Salesforce.com, Marketo, Eloqua, Omniture, Google Analytics, and dozens of more specialized offerings. Truthfully the availability of special purpose tools has outstripped the individual’s ability to integrate them.
In the coming years, we’ll move far beyond just cobbling together the tools that help us be more efficient and cater to our customers; we’ll have a marketing tool stack that implements and executes campaigns on its own.
Imagine a system that watches social feeds for popular items, aggregates existing content to resurface it into the discussion, and kicks off a set of new content assets to carry the conversation forward. And this happens between Friday and Sunday with little human effort.
Virtual Reality Is The New Content
Today marketers often focus on generating a considerable amount of written content. Tomorrow they will put the pen down and focus on virtual experiences for customers that allow them to interact with content in ways not possible before. With attention spans getting shorter, and the firehose of new content bombarding customers, brands will need to focus on things that don’t just inform, but also entertain.
Whereas today an automobile company might customize regional billboards to fit with the landscape, soon they will offer tailored virtual reality experiences in a city and driving venue of your choice.
Where The Real-Time Meets The Road
Finally all of this will come together in the insatiable pursuit of instant gratification. Not only will consumers not be surprised by real-time results, they will come to demand it. To stay on top, marketers, and the tools they use, will need to absorb, process, and contextualize information more quickly to deliver unique interactive experiences. This is already happening in areas like ad tech and finance, but stay tuned as the latest in real-time technologies work their way across all industries.
Read Post

Data Intensity
From Big to Now: The Changing Face of Data
Data is changing. You knew that. But the dialog over the past 10 years around big data and Hadoop is rapidly moving to data and real-time.
We have tackled how to capture big data at scale. We can thank the Hadoop Distributed File System for that, as well as cloud object stores like AWS S3.
But we have not yet tackled the instant results part of big data. For that we need more. But first, some history.
Turning Point for the Traditional Data Warehouse
Internet scale workloads that emerged in the past ten years threw the traditional data warehouse model for a loop. Specifically, the last generation of data warehouses relied on
Scale up models; andAppliance approaches
Vast amounts of internet and mobile data have made these prior approaches uneconomical, and today customers suffer the high cost of conventional data warehouses.
Read Post

Engineering
Design Principles at SingleStore
At SingleStore, we believe that every company should have the opportunity to become a real-time enterprise.
We believe that data drives business, and that data management performance leads to successful business results.
Specifically related to our products, we believe in:
The Need for Performance
To compete in a 24×7 business world, companies must be able to ingest data quickly, execute low latency queries, and support a large number of analytics users and data volume, all at the same time.
The Scale of Distributed Systems
Today, systems need to scale beyond a single server, and distributed systems remove single server performance and capacity constraints. The strength of many can act as one. Our blog post on Jumping the Database S-Curve details more.
The SQL Standard
The Structured Query Language has served the data industry well for decades. We see core SQL capabilities as a prerequisite to analytics success. SQL as an afterthought, or as a layer, is not sufficient for real-time applications. Recently, SQL has been making the news again as The Technology That Never Left Is Back. Of course, there is more to the world than just SQL, and SingleStore also supports JSON, key-value models, geospatial data and connectivity to Spark for more advanced functions.
The Flexibility to Store Data from Memory to Disk
Memory, specifically DRAM, provides performance for modern workloads. But memory must also be coupled with disk so companies can retain real-time and historical data in a single system.
The Move to the Cloud and Need for Cloud Choice
Computing is moving to the cloud, and companies need the ability to deploy solutions on any server configuration, on any public or private cloud, or as a service.
The Convergence of Transactions and Analytics
With the right architecture, transactions and analytics can occur within the same system, eliminating the ETL process, consolidating database and data warehouses, and allowing immediate insight into critical applications.
Read Post

Engineering
ArcGIS, Spark & SingleStore Integration
This is a guest post by Mansour Raad of Esri. We were fortunate to catch up with him at Strata+Hadoop World San Jose.
This post is replicated from Mansour’s Thunderhead Explorer blog
ArcGIS, Spark & SingleStore Integration
Just got back from the fantastic Strata + Hadoop 2017 conference where the topics ranged from BigData, Spark to lots of AI/ML and not so much on Hadoop explicitly, at least not in the sessions that I attended. I think that is why the conference is renamed Strata + Data from now on as there is more to Hadoop in BigData.
While strolling the exhibition hall, I walked into the booth of our friends at SingleStore and got a BIG hug from Gary. We reminisced about our co-presentations at various conferences regarding the integration of ArcGIS and SingleStore as they natively support geospatial types.
This post is a refresher on the integration with a “modern” twist, where we are using the Spark Connector to ETL geo spatial data into SingleStore in a Docker container. To view the bulk loaded data, ArcGIS Pro is extended with an ArcPy toolbox to query SingleStore, aggregate and view the result set of features on a map.<
Read Post

Trending
Jumping the Database S-Curve
Adaptation and Reinvention
Long term success hinges on adaptation and reinvention, especially in our dynamic world where nothing lasts forever. Especially with business, we routinely see the rise and fall of products and companies.
The long game mandates change, and the database ecosystem is no different. Today, megatrends of social, mobile, cloud, big data, analytics, IoT, and machine learning place us at a generational intersection.
Data drives our digital world and the systems that shepherd it underpin much of our technology infrastructure. But data systems are morphing rapidly, and companies reliant on data infrastructure must keep pace with change. In other words, winning companies will need to jump the database S-Curve.
The S-Curve Concept
In 2011, Paul Nunes and Tim Breene of the Accenture Institute for High Performance published Jumping the S-curve: How to Beat the Growth Cycle, Get on Top, and Stay There.
In the world of innovation, an S-curve explains the common evolution of a successful new technology or product. At first, early adopters provide the momentum behind uptake. A steep ascent follows, as the masses swiftly catch up. Finally, the curve levels off sharply, as the adoption approaches saturation.
The book details a common dilemma that too many businesses only manage to a single S-curve of revenue growth,
in which a business starts out slowly, grows rapidly until it approaches market saturation, and then levels off.
They share the contrast of stand-out, high-performance businesses that manage growth across multiple S-curves. These companies continually find ways to invent new products and services that drive long term revenue and efficiency.
Read Post

Trending
SQL: The Technology That Never Left Is Back!
The Prelude
The history of SQL, or Structured Query Language, dates back to 1970, when E.F. Codd, then of IBM Research, published a seminal paper titled, “A Relational Model of Data for Large Shared Data Banks.”
Since then, SQL has remained the lingua franca of data processing, helping build the relational database market into a \$36 billion behemoth.
The Rise And Fall of NoSQL
Starting in 2010, many companies developing datastores tossed SQL out with the bathwater after seeing the challenges in scaling traditional relational databases. A new category of datastores emerged, claiming a new level of scalability and performance. But without SQL they found themselves at a loss for enabling easy analytics. Before long, it was clear that there were many hidden costs of NoSQL.
The Comeback That Never Left
More recently, many point to a SQL comeback, although the irony is that it never left.
In a piece last week on 9 enterprise tech trends for 2017 and beyond, InfoWorld Editor in Chief Eric Knorr notes on trend number 3:
The incredible SQL comeback
For a few years it seemed like all we did was talk about NoSQL databases like MongoDB or Cassandra. The flexible data modeling and scale-out advantages of these sizzling new solutions were stunning. But guess what? SQL has learned to scale out, too – that is, with products such as ClustrixDB, DeepSQL, SingleStore, and VoltDB, you can simply add commodity nodes rather than bulking up a database server. Plus, such cloud database-as-a-service offerings as Amazon Aurora and Google Cloud SQL make the scale-out problem moot.
At the same time, NoSQL databases are bending over backward to offer SQL interoperability. The fact is, if you have a lot of data then you want to be able to analyze it, and the popular analytics tools (not to mention their users) still demand SQL. NoSQL in its crazy varieties still offers tremendous potential, but SQL shows no sign of fading. Everyone predicts some grand unification of SQL and NoSQL. No one knows what practical form that will take.
Taking the ‘no’ out of NoSQL
In an article, Who took the ‘no’ out of NoSQL?, Matt Asay writes,
In the wake of the NoSQL boom, we’re seeing a great database convergence between old and new.
Everybody wants to speak SQL because that’s where the primary body of skills reside, given decades of enterprise build-up around SQL queries.
The article, interviewing a host of NoSQL specialists, reminds us of the false conventional wisdom that SQL doesn’t scale. Quoting a former MongoDB executive, Assay notes,
But the biggest benefit of NoSQL, and the one that RDBMSes have failed to master, is its distributed architecture.
The reality is that legacy vendors have had trouble applying scale to their relational databases. However, new companies using modern techniques have shown it is very possible to build scalable SQL systems with distributed architectures.
SQL Reigns Supreme in Amazon Web Services
There is no better bellwether for technology directions these days than Amazon Web Services. And the statistics shared by AWS tell the story.
In 2015, Andy Jassy, CEO of Amazon Web Services, noted that the fastest growing service in AWS was the data warehouse offering Redshift, based on SQL.
In 2016, he noted that the fastest growing service in AWS was the database offering Aurora, based on SQL.
And one of the newest services, AWS Athena, delivers SQL on S3. This offering is conceptually similar to the wave of ‘SQL as a layer’ solutions developed by Hadoop purveyors so customers could have easy access to unstructured data in HFDS. Lo and behold there were simply not enough MapReduce experts to make sense of the data.
AWS has recognized a similar analytics conundrum with S3 growth, which has been so strong it appears that objects stores are becoming the new data lakes. And what do you do when you have lots of data to examine and want to do so easily? You add SQL.
SQL Not Fading Away
Nick Heudecker, Research Director in the Data and Analytics group at Gartner, put his finger on it recently,
Each week brings more SQL into the NoSQL market subsegment. The NoSQL term is less and less useful as a categorization.
— Nick Heudecker (@nheudecker) November 8, 2016
Without a doubt the data management industry will continue to debate the wide array of approaches possible with today’s tools. But if we’ve learned one thing over the last 5 years, SQL never left, and it remains as entrenched and important as ever.
Read Post

Engineering
Everything We’ve Known About Data Movement Has Been Wrong
Data movement remains a perennial obstacle in systems design. Many talented architects and engineers spend significant amounts of time working on data movement, often in the form of batch Extract, Transform, and Load (ETL). In general, batch ETL is the process everyone loves to hate, or put another way, I’ve never met an engineer happy with their batch ETL setup.
In this post, we’ll look at the shift from batch to real time, the new topologies required to keep up with data flows, and the messaging semantics required to be successful in the enterprise.
The Trouble with Data Movement
There is an adage in computing that the best operations are the ones you do not have to do. Such is the thinking with data movement. Less is more.
Today a large portion of time spent on data movement still revolves around batch processes, with data transferred at a periodic basis between one system and the next. However, Gartner states,
Familiar data integration patterns centered on physical data movement (bulk/batch data movement, for example) are no longer a sufficient solution for enabling a digital business.
And this comical Twitter message reflects the growing disdain for a batch oriented approach…
I hate batch processing so much that I won’t even use the dishwasher.
I just wash, dry, and put away real time.
— Ed Weissman (@edw519) November 6, 2015
There are a couple of ways to make batch processing go away. One involves moving to robust database systems that can process both transactions and analytics simultaneously, essentially eliminating the ETL process for applications.
Another way to reduce the time spent on batch processing is to shift to real-time workflows. While this does not change the amount of data going through the system, moving from batch to real-time helps normalize compute cycles, mitigate traffic surges, and provides timely, fresh data to drive business value. If executed well this initiative can also reduce the time data engineers spend moving data.
The Enterprise Streaming Opportunity
Most of the discussion regarding streaming today is happening in areas like the Internet of Things, sensors, web logs and mobile applications. But that is really just the tip of the iceberg.
Read Post

Engineering
New Performance Benchmark for Live Dashboards and Fast Updates
Newest Upsert Benchmark showcases critical use case for internet billing with telcos, ISPs, and CDNs
SingleStore achieves 7.9 million upserts per second, 6x faster than Cassandra
Benchmark details and scripts now available on GitHub
The business need for fast updates and live dashboards
Businesses want insights from their data and they want it sooner rather than later. For fast-changing data, companies must rapidly glean insights in order to make the right decisions. Industry applications like IoT telemetry monitoring, mobile network usage, internet service provider (ISP) billing, and content delivery network (CDN) usage tracking depend upon real-time analytics with fast-changing data. Web traffic merits special attention since it continues to grow at an astounding rate. According to Cisco, Global IP traffic will increase nearly threefold over the next 5 years, and will have increased nearly a hundredfold from 2005 to 2020. Overall, IP traffic will grow at a compound annual growth rate (CAGR) of 22 percent from 2015 to 2020. Many businesses face the challenge of monitoring, analyzing, and monetizing large scale web traffic, so we will explore this use case.
Use case example
In particular, we dive into the example of a content delivery or distribution network (CDN). A CDN is a globally distributed network of web servers deployed in multiple data centers across different geographic regions and is relied upon by content providers such as media companies and e-commerce vendors to deliver content to end users. CDNs have a business need to monitor their system in real-time. In addition to logging customer usage for the purpose of billing, they want to be alerted to sudden increases and decreases in their workloads for load balancing as well as for detecting network events like “denial of service attacks”. The sheer volume of web traffic mandates a massive parallel processing (MPP) system that can scale out to support the load. The concurrent need for real-time analytics points to the direction of hybrid transaction/analytical processing, or HTAP. HTAP systems enable high speed ingest and sophisticated analytics simultaneously without data movement or ETL.
Background on the Upsert Benchmark
This benchmark demonstrates the raw horsepower of a database system capturing high volume updates. Update, or upsert, is the operative word here. With a conventional `insert` a new row is created for each new database entry. With an upsert, individual rows can be updated in place. This upsert capability allows for a more efficient database table and faster aggregations, and it is particularly useful in areas such as internet billing. For more detail on this workload in use, take a look at this blog post, Turn Up the Volume With High-Speed Counters.
SingleStore delivers efficient upsert performance, achieving up to 8 million upserts per second on a 10 node cluster, using the following parameterized query:
Upsert query for SingleStore
insert into records (customer_code, subcustomer_id, geographic_region, billing_flag, bytes, hits) values
on duplicate key update bytes=bytes+VALUES(bytes),hits=hits+VALUES(hits);
Comparing Upsert performance
Legacy databases and data warehousing solutions are optimized for batch loading of data and subsequently are unable to handle fast data insertions along with ad-hoc analysis of freshly generated data. NoSQL databases like Cassandra can handle fast data insertions but have more challenges with upserts, which are critical for web traffic monitoring across end-customer behavior and tracking web requests. More importantly however, Cassandra does not provide native support for analytics and requires users to bring in additional components like SparkSQL in order to support meaningful querying of data.
We created the following query for Cassandra:
Upsert query for Cassandra
`update perfdb.records set hits = hits + 1 where timestamp_of_data=1470169743185 and customer_code=25208 and subcustomer_id='ESKWUEYXUKRB' and geographic_region=10 and billing_flag=1 and ip_address='116.215.6.236';`
The upsert benchmark is based on a simulated workload that logs web traffic across ten different geographic regions. SingleStoreDB Self-Managed 5.1 runs on a 10 node m4.10xlarge cluster on AWS, at \$2.394 per Hour (effective pricing with 1-year reserved instances), and is able to execute up to 8 million upserts per second and simultaneously run live queries on the latest data to provide a real-time window on the changing shape of traffic.
Cassandra running on an identical cluster achieves 1.5 million upserts per second. We tested the most recent 3.0.8 version of Apache Cassandra. In the Cassandra query, update means upsert.
As noted in the following chart, SingleStore scales linearly as we increase the number of machines with a batch size of 500. Cassandra however, does not appear to support large batch sizes well. According to Cassandra,
# Caution should be taken on increasing the size of this threshold as it can lead to node instability.
# Fail any batch exceeding this value. 50kb (10x warn threshold) by default.
So we set `batch_size_fail_threshold_in_kb: 5000` to support a 10,000 row batch size, but we encountered numerous errors that prevented the benchmark from running on Cassandra with these settings.
Read Post

Engineering
Third Normal Form, Star Schema, and a Performance Centric Data Strategy
Keeping it Straight
Data value comes from sharing, so staying organized and providing common data access methods across different groups can bring big payoffs.
Companies struggle daily to keep data formats consistent across applications, departments, people, divisions, and new software systems installed every year.
Passing data between systems and applications is called ETL, which stands for Extract, Transform, and Load. It is the process everyone loves to hate. There is no glamour in reconfiguring data such as date formats from one system to another, but there is glory in minimizing the amount of ETL needed to build new applications.
To minimize ETL friction, data architects often design schemas in third normal form, a database term that indicates data is well organized and unlikely to be corrupted due to user misunderstanding or system error.
Getting to Third Normal Form
The goal of getting to third normal form is to eliminate update, insertion, and deletion anomalies.
Take this employee, city, and department table as an example:
employee_id
employee_name
employee_city
employee_dept
101
Sam
New York
22
101
Sam
New York
34
102
Lori
Los Angeles
42
Update Anomalies
If Sam moves to Boston, but stays in two departments, we need to update both records. That process could fail, leading to inconsistent data.
Insertion Anomalies
If we have a new employee not yet assigned to a department and the ‘employee_dept’ field does not accept blank entries, we would be unable to enter them in the system.
Deletion Anomalies
If the company closed department 42, deleting rows with department 42 might inadvertently delete employee’s information like Lori’s.
First Normal Form to Start
First normal form specifies that table values should not be divisible into smaller parts and that each cell in a table should contain a single value.
So if we had a customer table with a requirement to store multiple phone numbers, the simplest method would be like this
customer_id
customer_name
customer_phone
101
Brett
555-459-8912 555-273-2304
102
Amanda
222-874-3567
However, this does not meet first normal form requirements with multiple values in a single cell, so to conform we could adjust it to
customer_id
customer_name
customer_phone
101
Brett
555-459-8912
101
Brett
555-273-2304
102
Amanda
222-874-3567
Second Normal Form
2nd normal form requires that
Data be in 1st normal formEach non-key column is dependent on the tables complete primary key
Consider the following example with the table STOCK and columns supplier_id, city, part_number, and quantity where the city is the supplier’s location.
supplier_id
city
part_number
quantity
A22
New York
7647
5
B34
Boston
9263
10
The primary key is (supplier_id, part_number), which uniquely identifies a part in a single supplier’s stock. However, city only depends on the supplier_id.
In this example, the table is not in 2nd normal form because city is dependent only on the supplier_id and not the full primary key (supplier_id, part_number).
This causes the following anomalies:
Update Anomalies
If a supplier moves locations, every single stock entry must be updated with the new city.
Insertion Anomalies
The city has to be known at insert time in order to stock a part at a supplier. Really what matters here is the supplier_id and not the city. Also unless the city is stored elsewhere a supplier cannot have a city without having parts, which does not reflect the real world.
Deletion Anomalies
If the supplier is totally out of stock, and a row disappears, the information about the city in which the supplier resides is lost. Or it may be stored in another table, and city does not need to be in this table anyway.
Separating this into two tables achieves 2nd normal form.
supplier_id
part_number
quantity
A22
7647
5
B34
9263
10
supplier_id
city
A22
New York
B34
Boston
Third Normal Form
We’re almost there! With 1st normal form, we ensured that every column attribute only holds one value.
With 2nd normal form we ensured that every column is dependent on the primary key, or more specifically that the table serves a single purpose.
With 3rd normal form, we want to ensure that non-key attributes are dependent on nothing but the primary key. The more technical explanation involves “transitive dependencies” but for the purpose of this simplified explanation we’ll save that for another day.
In the case of the following table, zip is an attribute generally associated with only one city and state. So it is possible with a data model below that zip could be updated without properly updating the city or state.
employee_id
employee_name
city
state
zip
101
Brett
Los Angeles
CA
90028
102
Amanda
San Diego
CA
92101
103
Sam
Santa Barbara
CA
93101
104
Alice
Los Angeles
CA
90012
105
Lucy
Las Vegas
NV
89109
Splitting this into two tables, so there is no implied dependency between city and zip, solves the requirements for 3rd normal form.
customer_id
customer_name
zip
101
Brett
90028
102
Amanda
92101
103
Sam
93101
104
Alice
90012
105
Lucy
89109
zip
city
state
90028
Los Angeles
CA
92101
San Diego
CA
93101
Santa Barbara
CA
90012
Los Angeles
CA
89109
Las Vegas
NV
Benefits of Normalization
Normalizing data helps minimize redundancy and maintain the highest levels of integrity. By organizing column attributes and the relations between tables, data administrators can design systems for efficiency and safety.
More specifically, normalization helps ensure
Data is not unnecessarily repeated within a databaseInserts, modifications, and deletions only have to happen once in a database
Data Management with Star Schema
Star schema is an approach of arranging a database into fact tables and dimension tables. Typically a fact table records a series of business events such as purchase transactions. Dimension tables generally store fewer records than fact tables but may have more specific details about a particular record. A product attributes table is one example.
Star schemas are often implemented in a denormalized fashion, with typical normalization rules relaxed. The advantage of this can be simpler reporting logic and faster performance as data may be stored multiple ways to facilitate queries.
The disadvantage of this approach is that integrity is not necessarily enforced through the model leaving room for an update in one place that may not successfully propagate elsewhere.
Further, with normalization, a large variety of data analytics tools and approaches can be used to query data without explicit advanced knowledge. Without normalization, schemas tend to become isolated to specific functions and less flexible across a large organization.
Flexible Star Schema Deployments with SingleStore
Is it possible or desirable to merge normalization and star schemas? Sure.
While data management strategies can be very application specific, retaining data in the most universally accessible forms benefits larger organizations. With normalization, data organization transcends application use cases and database systems.
Star schemas often skip normalization for two reasons: simplicity of queries and performance.
Regarding query simplicity, this is a tradeoff between application-specific approaches and data ubiquity across an organization. Independent of the database, this tradeoff remains.
When it comes to performance, historical systems have had challenges with operations like fast aggregations, and a large number of joins driven by third normal form. Modern database architectures have eliminated those performance challenges.
With a solution like SingleStore, a memory-optimized, relational, distributed database, it is possible to achieve normalization and performance. Even with the increased number of tables, and subsequent joins, often resulting from third normal form, SingleStore maintains stellar performance. And the core relational SQL model makes it easy to create or import a range of tables as well as maintain relations between tables.
In the next sections, we’ll explore table types in SingleStore and the associated benefits.
Using Multiple Table Types in SingleStore
SingleStore includes two table types:
A rowstore table where all the data is retained in memory and all data is persisted to diskA columnstore table where some data resides in memory and all data is persisted to disk
Using these two table types is it possible to design a wide range of schema configurations.
Contrary to popular belief, determining whether you use an all-memory or memory-plus-disk table has less to do with data size, and more with how you plan to interact with the data.
Columnstores are useful when rows are added or removed in batches, and when queries touch all or many records but only for a few columns. Aggregations like Sum, Average, and Count are good examples.Rowstores work well when operating over whole rows at a time. This includes updates to individual attributes or point lookups.For more detail on rowstores and columnstores check out
Should You Use a Rowstore or a Columnstore? From SingleStore VP of Engineering Ankur Goyal
Creating a Star Schema in SingleStore
Whether or not you lean towards normalization, SingleStore makes it easy to create a star schema within a single database across multiple table types.
Figure: Basics of a star schema with Fact and Dimension tables
Read Post

Trending
Close Encounters with a Third Kind of Database
For years, we have lived in a data processing world with two primary kinds of database systems, one for capturing incoming data and managing transactions, and another for answering questions and analyzing the data for insight and intelligence.
The first system is what we know as a database, a system designed so you can put data in and get data out reliably and quickly. The second system is generally referred to as a data warehouse.
There has long been a historical divide between these two systems, but HTAP or Hybrid Transaction Analytical Processing, is a new approach that brings these two worlds together.
Check out the complete article on CIO Review
Read Post

Trending
Visit SingleStore in May: GEOINT, Informatica World, In-Memory Computing Summit
SingleStore has a full show schedule in May and we hope you can visit us at one of these events.
GEOINT: May 15-18, Orlando
http://geoint2016.com/
May 15-18, 2016, Gaylord Palms Resort and Convention Center, Orlando, Florida
SingleStore is at booth 1514
GEOINT Symposium is the nation’s largest gathering of industry, academia, and government to include Defense, Intelligence and Homeland Security Communities.
GEOINT is hosted by USGIF, the United States Geospatial Intelligence Foundation
SingleStore will be featured in a lighting talk
Real-Time Geospatial Intelligence at Scale
Mike Kilrain, President, SingleStore USG
Sunday, May 15th, 12:13pm at GEOINT
Read Post

Data Intensity
Always-on Geospatial Analytics
In the past ten years technology shifts have re-crafted the geospatial applications and analytics landscape.
The iPhone and Android ecosystems have fostered a world where almost everyone is a beacon of informationLarge scale computing capabilities have provided companies like Google and Facebook the ability to keep track of billions of things, and companies like Amazon and Microsoft are making similar computing power available to everyoneGlobal internet coverage continues to expand, including innovative programs with balloons and solar powered drones
These trends shape billion dollar shifts in the mapping and geospatially-oriented industries, for example
In August 2015, a consortium of the largest German automakers including Audi, BMW, and Daimler (Mercedes) bought Nokia’s Here mapping unit, the largest competitor to Google Maps, for \$3.1 billion.
In addition to automakers like the German consortium having a stake in owning and controlling mapping data and driver user experiences, the largest private companies, like Uber and Airbnb, depend on maps as an integral part of their applications.
Source: VentureBeat
New applications, particularly those around geospatial analytics, are harnessing the geospatial capabilities of an in-memory approach. In particular, transportation is just one of many industries undergoing dramatics shifts as new technology forces align.
To understand the full shifts underway, and read about transportation analytics examples with New York Taxis, and Esri approach to in-memory databases, and real-time dashboards with Zoomdata, check out our full white paper:
PDF Download: Always-on Geospatial White Paper ⇒
Read Post

Data Intensity
Choosing the Right Infrastructure for IoT
The infrastructure of IoT will have a real-time database behind every sensor.
Soon every device with a sensor will blend seamlessly into the Internet of Things, from drones to vehicles to wearables. Device and sensor count predictions range from billions to trillions. With this tidal wave of new devices comes an increasing number of new data streams, converging to make instant analytics on real-time data a tenant of any digital transformation.
Our penchant for instant gratification extends to every time we press a button or ask a question online. Today, data must move at the speed of thought and real-time information brings us as close as possible to the present.
The infrastructure to make this interaction possible ranges from the edge of the network into the core of the data center, and must include a database to support new interactive applications and analytics. Let’s examine a few compelling IoT use cases where turning data into actionable insights is table stakes.
Drones – Managing the Machines
Read Post

Trending
5 Big Data Themes – Live from the Show Floor
We spent last week at the Big Data Innovation Summit in Boston. Big data trade shows, particularly those mixed with sophisticated practitioners and people seeking new solutions, are always a perfect opportunity to take a market pulse.
Here are the big 5 big data themes we encountered over the course of two days.
Real-Time Over Resuscitated Data
The action is in real time, and trade show discussions often gravitate to deriving immediate value from real-time data. All of the megatrends apply… social, mobile, IoT, cloud, pushing startups and global companies to operate instantly in a digital,connected world.
While there has been some interest in resuscitating data from Hadoop with MapReduce or SQL on Hadoop, those directions are changing. For example, Cloudera recently announced the One Data Platform Initiative, indicating a shift from MapReduce
this initiative will enable [Spark] to become the successor to Hadoop’s original MapReduce framework for general Hadoop data processing
With Spark’s capabilities for streaming and in-memory processing, we are likely to see a focus on those real-time workflows. This is not to say that Spark won’t be used to explore expansive historical data throughout Hadoop clusters.
But judge your own predilection for real-time and historical data. Yes, both are important, but human beings tend to have an insatiable desire for the now.
Data Warehousing is Poised for Refresh
When the last wave of data warehousing innovation hit mainstream, there was a data M&A spree that started with SAP’s acquisition of Sybase in May 2010. Within 10 months, Greenplum was acquired by EMC, Netezza by IBM, Vertica by HP, and Aster by Teradata.
Today, customers are suffering economically with these systems which have become expensive to maintain and do not deliver the instant results companies now expect.
Applications like real-time dashboards push conventional data warehousing systems beyond their comfort zone, and companies are seeking alternatives.
Getting to ETL Zero
If there is a common enemy in the data market, it is ETL, or the Extract, Transform, and Load process. We were reminded of this when Riley Newman from Airbnb mentioned that
ETL was like extracting teeth…no one wanted to do it.
Ultimately, Riley did find a way to get it done by shifting ETL from a data science to a data engineering function (see final theme below), but I have yet to meet a person who is happy with ETL in their data pipeline.
ETL pain is driving new solution categories like Hybrid Transactional and Analytical Processing, or HTAP for short. In HTAP solutions, transactions and analytics converge on a single data set, often enabled by in-memory computing. HTAP capabilities are the forefront of new digital applications with situational awareness and real-time interaction.
The Matrix Dashboard is Coming
Of course, all of these real-time solutions need dashboards, and dashboards need to be seen. Hiperwall makes a helpful solution to tie multiple monitors together in a single, highly-configurable screen. The dashboards of the future are here!
Read Post

Trending
Locate This! The Battle for App-specific Maps
In early August, a consortium of the largest German automakers including Audi, BMW, and Daimler (Mercedes) purchased Nokia’s Here mapping unit, the largest competitor to Google Maps, for \$3 billion.
It is no longer easy to get lost. Quite the opposite, we expect and rely on maps for our most common Internet tasks from basic directions to on-demand transportation, discovering a new restaurant or finding a new friend.
And the battle is on between the biggest public and private companies in the world to shore up mapping data and geo-savvy engineering talent. From there, the race continues to deliver the best mapping apps.
Recently a story on the talent war among unicorn private companies noted
Amid a general scramble for talent, Google, the Internet search company, has undergone specific raids from unicorns for engineers who specialize in crucial technologies like mapping.
Wrapping our planet in mobile devices gave birth to a new geographic landscape, one where location meets commerce and maps play a critical role. In addition to automakers like the German consortium having a stake in owning and controlling mapping data and driver user experiences, the largest private companies like Uber and Airbnb depend on maps as an integral part of their applications.
That is part of the reason purveyors of custom maps like Mapbox have emerged to handle mapping applications for companies like Foursquare, Pinterest, and Mapquest. Mapbox raised \$52.6 million earlier this summer to continue its quest.
Mapbox and many others in the industry have benefitted from the data provided by Open Street Maps, a collection of mapping data free to use under an open license. Of course some of the largest technology companies in the world besides Google maintain their own mapping units including Microsoft (Bing Maps) and Apple Maps.
Investment in the Internet of Things combined with mobile device proliferation are creating a perfect storm of geolocation information to be captured and put to use. Much of this will require a analytics infrastructure with geospatial intelligence to realize its value.
In a post titled, Add Location to Your Analytics, Gartner notes
The Internet of Things (IoT) and digital business will produce an unprecedented amount of location-referenced data, particularly as 25 billion devices become connected by 2020, according to Gartner estimates.
and more specifically
Dynamic use cases require a significantly different technology that is able to handle the spatial processing and analytics in (near) real time.
Of course geospatial solutions have been around for some time, and database providers often partner with the largest private geospatial company, Esri, to bring them to market. In particular, companies developing in-memory databases like SAP and SingleStore have showcased work with Esri. By combing the best in geospatial functions with real-time, in-memory performance, application makers can deliver app-specific maps with unprecedented level of consumer interaction.
Google’s balloons and Facebook’s solar powered drones may soon eliminate the dead zones from our planet, perhaps removing the word “lost” from our vocabulary entirely. Similarly, improvements in interior mapping technology guarantee location specific details down to meters. As we head to this near-certain future, maps, and the rich, contextual information they provide, appear to be a secret weapon to delivering breakout application experiences.
Download SingleStore today to try a real-time database with native geospatial intelligence at: singlestore.com/free.
Read Post

Trending
Four Reasons Behind the Popularity and Adoption of In-Memory Computing
There is no question that data is infiltrating our world. Recently 451 Research predicted that the Total Data Market is expected to double in size from $60 billion in 2014 to $115 billion in 2019.
IDC suggested that Internet of Things (IoT) spending will reach \$1.7 trillion in 2020. and noted, “the real opportunity remains in the enterprise…”
And as stated in a recent Gartner blog post, while the three leading independent Hadoop distribution players measure their revenue in 10s of millions, commercial database vendors like Oracle, Microsoft, IBM, SAP and Teradata measure revenues in billions or 10s of billions of dollars in a \$33 billion dollar market.
The data market is hot, and in-memory delivers the capabilities companies need to keep up. In the report Market Guide for In-Memory DMBS, published December 2014, analysts Roxane Edjlali, Ehtisham Zaidi, and Donald Feinberg outline the growing importance of in-memory.
Four Reasons for the popularity and adoption of In-Memory
Declining costs in memory and infrastructure
Server main memory (now called server-class memory) is expanding to sizes as high as 32TB and 64TB at an increasingly lower cost, thereby enabling new in-memory technologies such as IMDBMSs, because many applications’ working sets fit entirely into this larger memory. This rapid decline in the infrastructure and memory costs results in significantly better price/performance, making IMDBMS technology very attractive to organizations.
Growing importance of high-performance use cases
The growing number of high performance, response-time critical and low-latency use cases (such as real-time repricing, power grid rerouting, logistics optimization), which are fast becoming vital for better business insight, require faster database querying, concurrency of access and faster transactional and analytical processing. IMDBMSs provide a potential solution to all these challenging use cases, thereby accelerating its adoption.
Improved ROI promise
A cluster of small servers running an IMDBMS can support most or all of an organization’s applications, drastically reducing operating costs for cooling, power, floor space and resources for support and maintenance. This will drive a lower total cost of ownership (TCO) over a three- to five-year period and offset the higher total cost of acquisition from more expensive servers.
Improved data persistence options
Most IMDBMSs now offer features for supporting “data persistence,” that is the ability to survive disruption of their hardware or software environment. Techniques like high availability/disaster recovery (HA/DR) provide durability by replicating data changes from a source database, called the primary database, to a target database, called the standby database. This means that organizations can continue to leverage IMDBMS-enabled analytical and transactional use cases without worrying about prolonged system downtime or losing their critical data to power failures.
From Market Guide for In-Memory DBMS, Roxane Edjlali, Ehtisham Zaidi, Donald Feinberg, 9 December 2014
Download the Complete Report
If you’d like to read more on the state of the In-Memory DBMS market,
download the entire report here.
Read Post

Trending
Tech Field Day Reception May 13th
Tech Field Day is coming to San Francisco next week with a focus on data, and that means time for a party!On Wednesday, May 13th, at 6:00pm in San Francisco, we will gather with industry participants and expert delegates from Tech Field day for an evening of food, drinks, and engaging conversation about All Things Data!Please RSVP for the evening reception.\ Wednesday, May 13th, 6:00pm, 534 4th St, San Francisco.About Tech Field DayFrom big data to analytics to hyperscale architecture and cloud security, a new wave of innovation is transforming IT. The old way of doing things can’t keep up with the proliferation of data and micro-services essential to deliver services to mobile devices and the Internet-connected world. These “new stack” companies and technologies have a different audience from the traditional infrastructure. That’s why we created Data Field Day!Agenda6:00 – 7:00: Welcome reception, food and drinks (non-alcoholic too!)7:00 – 8:00: Ignite presentations8:00 – 9:00: More networking, food and drinksThe event will include all of the delegates from Tech Field day listed below.Please RSVP for this free event and we hope to see you there.Delegates
Read Post

Data Intensity
Driving Relevance with Real-Time and Historical Data
As technology weaves into our daily lives, our expectations of it continue to increase. Consider mobile devices and location information. Recently 451 Research released data that 47% of consumers would like to receive personalized information based on immediate location.
Read Post

Trending
Filling the Gap Between HANA and Hadoop
Takeaways from the Gartner Business Intelligence and Analytics Summit
Last week, SingleStore had the opportunity to participate in the Gartner Business Intelligence and Analytics Summit in Las Vegas. It was a fun chance to talk to hundreds of analytics users about their current challenges and future plans.
As an in-memory database company, we fielded questions on both sides of the analytics spectrum. Some attendees were curious about how we compared with SAP HANA, an in-memory offering at the high-end of the solution spectrum. Others wanted to know how we integrated with Hadoop, the scale-out approach to storing and batch processing large data sets.
And in the span of a few days and many conversations, the gap between these offerings became clear. What also became clear is the market appetite for a solution.
Hardly Accessible Not Affordable
While HANA does offer a set of in-memory analytical capabilities primarily optimized for the emerging SAP S4/HANA Suite, it remains at such upper echelons of the enterprise IT pyramid that it is rarely accessible across an organization. Part of this stems from the length and complexity of HANA implementations and deployments. Its top of the line price and mandated hardware configurations also mean that in-memory capabilities via HANA are simply not affordable for a broader set of needs in a company.
Hanging with Hadoop
On the other side of the spectrum lies Hadoop, a foundational big data engine, but often akin to a large repository of log and event data. Part of Hadoop’s rise has been the Hadoop Distributed File System (HDFS) which allowed for cheap and deep storage on commodity hardware. MapReduce, the processing framework atop HDFS, powered the first wave of big data, but as the world moves towards real-time, batch processing remains helpful but rarely sufficient for a modern enterprise.
In-Memory Speeds and Distributed Scale
Between these ends of the spectrum lies an opportunity to deliver in-memory capabilities with an architecture on distributed, commodity hardware accessible to all.
The computing theme of this century is piles of smaller servers or cloud instances, directed by clever new software, relentlessly overtaking use-cases that were previously the domain of big iron. Hadoop proved that “big data” doesn’t mean “big iron.” The trend now continues with in-memory.
Moving To Converged Transactions and Analytics
At the heart of the in-memory shift is the convergence of both transactions and analytics into a single system, something Gartner refers to as Hybrid transactional/analytical processing (HTAP).
In-memory capabilities make HTAP possible. But data growth means the need to scale. Easily adding servers or cloud instances to a distributed solution lets companies meet capacity increases and store their highest value, most active data in memory.
But an all-memory, all-the-time solution might not be right for everyone. That is where combining all-memory and disk-based stores within a single system fits. A tiered architecture provides infrastructure consolidation and low cost expansion high value, less active data.
Finally, ecosystem integration makes data pipelines simple, whether that includes loading directly from HDFS or Amazon S3, running a high-performance connector to Apache Spark, or just building upon a foundational programming language like SQL.
SQL-based solutions can provide immediate utility across large parts of enterprise organizations. The familiarity and ubiquity of the programming language means access to real-time data via SQL becomes a fast path to real-time dashboards, real-time applications, and an immediate impact.
Related Links:
To learn more about How HTAP Remedies the Four Drawbacks of Traditional Systems here
Want to learn more about in-memory databases and opportunities with HTAP? – Take a look at the recent Gartner report here.
If you’re interested in test driving an in-memory database that offers the full benefits of HTAP, give SingleStore a try for free, or give us a ring at (855) 463-6775.
Read Post