
Data Intensity
Real-Time Streaming, Analytics, and Visualization with SingleStore and Zoomdata
We regularly host meetups at SingleStore headquarters as a way to share what we have been working on, connect with the community, and get in-person feedback from people like you. We also invite partners and customers to join us as well. Recently, we had the pleasure of hosting a meetup with Zoomdata, where we shared two presentations on real-time streaming, analytics, and visualization using SingleStore and Zoomdata.
Read Post

Data Intensity
Delivering Scalable Self-Service Analytics
Within 48 hours of launching Google Analytics as a free product, virtually all of Google’s servers crashed. Eric Schmidt called this Google’s “most successful disaster.” Why would a free product, whose hardware requirements melted down a datacenter, be worth it? Wesley Chan, the creator of Google Analytics, later said that, “Google Analytics generates about three billion dollars in extra revenue,” as noted in Steven Levy’s book, In The Plex. Google Analytics allowed Google’s customers to measure how good AdWords actually were, and showed them, with their own data, exactly how high they could increase bids and still make money. As Chan said, “know more, spend more.”
The full potential of such an offering comes when customers are allowed to arbitrarily segment and calculate flexible aggregates. To do that, they need to be able to query raw unaggregated data. This way your company does not have to guess what the customer wants to aggregate, as all choices remain available. If raw data access is not provided, then data must be precomputed, at least on some dimensions, which limits flexibility and the extent of the insights users can get from data.
Cost and technology constraints have led most companies to build analytics with this precompute approach for customers, because they need to serve analytics to many customers concurrently. The scale required to offer raw data access remained untenable. It was unthinkable to perform computations on raw data points on the scale of billions of rows per request concurrently for thousands of customers.
Today, SingleStore is changing that conventional wisdom and offering companies the ability to serve raw unaggregated data performance to a range of customers.
To explain this capability further, there are three major pieces of technology in this use case:
Scale-outColumnstore query executionEfficient data isolation
Scale-Out
Knowing system performance characteristics on a per-core basis, users can calculate how much compute and storage is needed to serve analytics at scale. Once that calculation is done, the key is to utilize a distributed system allowing enough dedicated compute power to meet demand. SingleStore can be used to run one to hundreds of nodes, which lets users scale the performance appropriately.
For example, if you have a million customers with one million data points each, you can say that you have one trillion data points. Imagine that at the peak, one thousand of those customers are looking at the dashboard simultaneously – essentially firing off one thousand concurrent queries against the database. Columnstore compression can store these trillion rows on a relatively small SingleStore cluster with approximately 20 nodes. Conservatively, SingleStore can scan 100 million rows per second per core, which mean that just one core can service 100 concurrent queries scanning one million rows each, and deliver sub-second results for analytical queries over raw data – below we will provide a benchmark for a columnstore query execution performance.
Columnstore Query Execution
A simple query over a columnstore table, such as a `GROUP BY`, can run at a rate of hundreds of millions to over a billion data points per second per core.
To demonstrate this, we loaded a public dataset about every airline flight in the United States from 1987 until 2015. As the goal was to understand performance per core, we loaded this into a single node SingleStore cluster running on a 4 core, 8 thread Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz.
To repeat this experiment, download the data using the following bash script:
mkdir csv
for s in `seq 1987 2015`
do
for m in `seq 1 12`
do
wget http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_${s} _${m} .zip
done
done
Create this table:
CREATE TABLE ontime (
Year INT,
Quarter INT,
Month INT,
DayofMonth INT,
DayOfWeek INT,
FlightDate Date,
UniqueCarrier Varchar(100),
AirlineID INT,
Carrier Varchar(100),
TailNum Varchar(100),
FlightNum Varchar(100),
OriginAirportID INT,
OriginAirportSeqID INT,
OriginCityMarketID INT,
Origin Varchar(100),
OriginCityName Varchar(100),
OriginState Varchar(100),
OriginStateFips Varchar(100),
OriginStateName Varchar(100),
OriginWac INT,
DestAirportID INT,
DestAirportSeqID INT,
DestCityMarketID INT,
Dest Varchar(100),
DestCityName Varchar(100),
DestState Varchar(100),
DestStateFips Varchar(100),
DestStateName Varchar(100),
DestWac INT,
CRSDepTime INT,
DepTime INT,
DepDelay INT,
DepDelayMinutes INT,
DepDel15 INT,
DepartureDelayGroups Varchar(100),
DepTimeBlk Varchar(100),
TaxiOut INT,
WheelsOff INT,
WheelsOn INT,
TaxiIn INT,
CRSArrTime INT,
ArrTime INT,
ArrDelay INT,
ArrDelayMinutes INT,
ArrDel15 INT,
ArrivalDelayGroups INT,
ArrTimeBlk Varchar(100),
Cancelled INT,
CancellationCode Varchar(100),
Diverted INT,
CRSElapsedTime INT,
ActualElapsedTime INT,
AirTime INT,
Flights INT,
Distance INT,
DistanceGroup INT,
CarrierDelay INT,
WeatherDelay INT,
NASDelay INT,
SecurityDelay INT,
LateAircraftDelay INT,
FirstDepTime Varchar(100),
TotalAddGTime Varchar(100),
LongestAddGTime Varchar(100),
DivAirportLandings Varchar(100),
DivReachedDest Varchar(100),
DivActualElapsedTime Varchar(100),
DivArrDelay Varchar(100),
DivDistance Varchar(100),
Div1Airport Varchar(100),
Div1AirportID INT,
Div1AirportSeqID INT,
Div1WheelsOn Varchar(100),
Div1TotalGTime Varchar(100),
Div1LongestGTime Varchar(100),
Div1WheelsOff Varchar(100),
Div1TailNum Varchar(100),
Div2Airport Varchar(100),
Div2AirportID INT,
Div2AirportSeqID INT,
Div2WheelsOn Varchar(100),
Div2TotalGTime Varchar(100),
Div2LongestGTime Varchar(100),
Div2WheelsOff Varchar(100),
Div2TailNum Varchar(100),
Div3Airport Varchar(100),
Div3AirportID INT,
Div3AirportSeqID INT,
Div3WheelsOn Varchar(100),
Div3TotalGTime Varchar(100),
Div3LongestGTime Varchar(100),
Div3WheelsOff Varchar(100),
Div3TailNum Varchar(100),
Div4Airport Varchar(100),
Div4AirportID INT,
Div4AirportSeqID INT,
Div4WheelsOn Varchar(100),
Div4TotalGTime Varchar(100),
Div4LongestGTime Varchar(100),
Div4WheelsOff Varchar(100),
Div4TailNum Varchar(100),
Div5Airport Varchar(100),
Div5AirportID INT,
Div5AirportSeqID INT,
Div5WheelsOn Varchar(100),
Div5TotalGTime Varchar(100),
Div5LongestGTime Varchar(100),
Div5WheelsOff Varchar(100),
Div5TailNum Varchar(100),
key (AirlineID) using clustered columnstore
);
Then load data into the table:
`load data infile '/home/memsql/csv/*' into table ontime fields terminated by ',' enclosed by '"' lines terminated by ',\n' ignore 1 lines;`
Once the data is loaded, run a simple group by command. The following query performs a full table scan:
SELECT OriginCityName, count(*) AS flights
FROM ontime GROUP BY OriginCityName ORDER BY flights DESC LIMIT 20;
On a machine with 4 cores, a 164 million row dataset query runs in 0.04 seconds which is 1 billion rows per second per core. No, that’s not a typo. That’s a billion rows per second per core. More complex queries will consume more CPU cycles, but with this level of baseline performance there is a lot of room across a cluster of 8, 16, or even hundreds of machines to handle multi-billion row datasets with response times under a quarter of a second. At that speed, queries appear to be instantaneous to users, leading to great user satisfaction.
Try this example using SingleStoreDB Self-Managed 6. New vectorized query execution techniques in SingleStoreDB Self-Managed 6, using SIMD and operations directly on encoded (compressed) data, make this speed possible.
Efficient Data Isolation Per Customer
Data warehouses such as Redshift and Big Query support large scale, but may not sufficiently isolate different queries in highly concurrent workloads. On top of that, both have a substantial fixed overhead on a per query basis. Redshift in particular does not support many concurrent queries: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html.
Depending on the analytical requirements, SingleStore allows for an ordered and partitioned physical data layout to ensure only scanned data belongs to a single customer. In our example, the columnstore was clustered on AirlineID.
SingleStore supports clustered columnstore keys that allow global sorting of columnstore tables. In this case, if you have a predicate on AirlineID a user will only scan the subset of data belonging to that airline. This allows SingleStore to deliver on very high concurrency (in the thousands of concurrent queries) with each query scanning and aggregating millions of data points.
More on Query Execution
At SingleStore, we are continuously innovating with new query processing capabilities. This is a list of recent innovations in our shipping product: https://archived.docs.singlestore.com/v6.0/release-notes/memsql/60-release-notes/.
Bringing it All Together
Going back to our original example, though our dataset is one trillion rows, because of the clustered columnstore key, each customer only needs to scan through one million rows. For a simple query like the above, scanning 500 million rows per second per core means that a single CPU core could support 500 concurrent queries and deliver sub-second performance.
To recreate the work mentioned in this blog, try out SingleStoreDB Self-Managed 6: singlestore.com/free.
Read Post

Data Intensity
Key Considerations for a Cloud Data Warehouse
Data growth and diversity has put new pressures on traditional data warehouses, resulting in a slew of new technology evaluations. The data warehouse landscape offers a variety of options, including popular cloud solutions that offer pay-as-you-go pricing in an easy-to-use and scale package. Here are some considerations to help you select the best cloud data warehouse.
First, Identify Your Use Case
A cloud data warehouse supports numerous use cases for a variety of business needs. Here are some common use cases along with the notable capabilities required for each.
Ad Hoc Analysis
Ad hoc analysis provides guided or open queries to the data warehouse, giving the end user flexibility to explore deeper questions. Users use native SQL or an interactive visual analysis tool such as Tableau or Looker. Each query result often prompts the user to dive further into the data, going from summary or aggregate views into distinct row level detail. A data warehouse that is good at ad hoc analysis delivers fast consistent responses across a variety of query types.
How does a data warehouse support ad hoc analysis?
Efficient query processing that can scan, join, and aggregate data in a variety of table structures.Columnstore table format for optimized disk usage and accelerated aggregate query response.Relational data format with ANSI SQL query syntax provides a familiar, easy to use structured language.Built-in statistical functions such as MAX, MIN, SUM, COUNT, STD, NTILE, and RANK, to name a few, will make it easier to build sophisticated queries.Data security ensures different users are shielded from sensitive or unauthorized data, requiring user authentication, role based access control, and row level security.Scalable concurrency for supporting thousands of users running a variety of queries simultaneously.Native connectivity to leading business intelligence tools for easier visual analysis and collaborative dashboards.
Machine Learning and Data Science
Data science and machine learning use a data warehouse to identify trends, discover hidden data relationships, and predict future events with sophisticated algorithms. Machine learning is a technique that can learn and improve insight discovery without explicitly being programmed to do so. Data scientists will often require large volumes of data to improve their predictions and correlations. Data is often enriched and cleaned or packaged into sample data sets for faster experimentation. Experiments are commonly performed offline due to the intense processing power required for the analysis. Advances in algorithms, hardware, machine learning and artificial intelligence tooling have led to more advanced data processing methods that can automatically identify hard to find events with relatively little human coordination.
How does a data warehouse support machine learning and data science?
Support a variety of data types including relational, CSV, JSON, and geospatial formats.Provide native interoperability with data preparation and statistical tooling, such as Spark, SparkML, Python, R, SAS, and TensorFlow.To maximize resource savings, offer rapid sandbox configuration for quick experimentation with easy spin-up and termination of databases as load requirements change.To support collaboration and sharing of analyses, offer native connectivity with modern business intelligence tools such as Tableau, Zoomdata, and Looker.
Real-Time and Operational Analytics
Operational analytics applications often manage Key Performance Indicators (KPIs) by querying data continuously. The insights might be used several times a day by people or machines. The speed of response for an operational or real-time analytics solution can vary based on the systems in place and the organizational readiness. Gartner’s Roy Schulte said it best in his report, How to Move Analytics to Real Time:
“Business real time is about situation awareness; sensing and responding to what is happening in the world now, rather than to what happened a few hours or days ago, or what is predicted to happen based on historical data.”
How does a data warehouse support real-time analytics?
Streaming ingestion of data that can be immediately queried.Fast processing of repeat queries, potentially by thousands of users or applications.To reduce outages and maintain 24/7 operational support, high availability that includes redundancy and auto-failover.To improve accuracy and decision speeds, exactly once semantics for real-time data de-duplication and enrichment.
Mixed Workload Analytics
Most organizations want a single source of data to improve decision accuracy and support a variety of workloads across ad hoc, machine learning, and real-time analytics. These expanded use cases place a strong emphasis on performance, security, and user or application concurrency. Due to the variety of applications requiring sub-second data access, mixed workloads can be a challenge to tune and govern.
How does a data warehouse support mixed workload analytics?
A robust, efficient, and distributed query processor that can support a broad range of queries without overpaying for extra hardware resources or require hard-to-manage database configurations.Rapid easy-to-scale architecture that can address changes in workload complexity and user concurrency load.Comprehensive security to shield users from seeing sensitive data without requiring custom database schemas or views.Broad data ingestion to support real-time streaming and batch load requirements.
Next Up, Understanding Cloud Data Warehouse Capabilities
As you evaluate your next cloud data warehouse investment, it’s important to know the range of capabilities that are important for your project or business. Below is a list of capabilities organized by category to help you identify the right data warehouse:
Usability
Rapid provisioning: Setup should be self-service and take a few minutes from the point of sign up to a running functioning databaseAccessibility: For easy query processing and integration with existing applications, tools, and skills, the environment should support relational data using ANSI SQLEasy data loading: A guided or integrated data loading process should give users an easy integrated way to deploy a real-time data pipeline or bulk load ingestionOptimized query processing: The database should have a distributed query optimizer that can process most queries with minimal specialized tuningSimplified capacity management: As data or user growth expands, the data warehouse should provide managed or automated capacity adjustment to quickly address changing workloads
Performance
Ingest to analysis: Streaming data ingestion with simultaneous query processing ensures the fastest possible insights on live and historical dataFast queries: Subsecond query response against billions of rows with vectorized query processing and columnstore structure for ad-hoc dashboards or operational reportsOperationally tuned: Compiled SQL queries accelerate query execution for added performance gains
Cost
On-demand pricing: Sometimes a data warehouse is not required for 24/7 operation; hourly billing can tightly associate the usage to payment.Annual discounts: Reserved pricing discounts should be an option for operational deployments that are always available
Flexibility
Multicloud: To maximize the proximity of your data and get the ultimate performance for your applications, you need the freedom to choose the cloud service provider you prefer or have standardized onHybrid cloud: Maintain existing investments by spanning data warehouse investments across on-premises and cloud on a single platformElastic: Driven by growth in data, users, or query sophistication, rapidly scale out or down for new capacity requirementsInteroperable: To ensure compatibility with existing tools, applications, and skills, support JDBC/ODBC connectivity, MySQL wire protocol, and ANSI SQL
Scalability
Concurrency support: Scale-out distributed architecture ensures that high volume ingest and write queries do not degrade dashboard or report performanceHigh Availability: Efficient replication and distributed architecture ensures no single point of failure for operational requirementsDurable: All data should reside on disk for audit or regulatory requirements along with expedited recovery from unexpected failures
Security
Comprehensive: Data should be secured across the analysis lifecycle, from single sign-on (SSO) authentication, role based access control (RBAC), SSL encryption of data over the wire, encryption of data at rest, granular audit logging, and separation of concerns for database administratorsConsistent: Ensure a consistent security model across on-premises to the cloud with strong security capabilities across deployments
Conclusion: Considerations for SingleStoreDB Cloud
SingleStoreDB Cloud offers all the desired capabilities described above as a full-featured cloud data warehouse that is easy to set up and use for supporting a mix of workloads in a single integrated platform. The product delivers a fast, flexible, and secure environment that is capable of analyzing both live and historical data. The pay-as-you-go service gives organizations an affordable approach to real-time analytics. Try SingleStoreDB Cloud today and get a \$300 free credit offer resulting in up to 300 hours of free usage.
Try SingleStoreDB Cloud Now
Read Post

Data Intensity
AI’s Secret Weapon: The Data Corpus
Modern businesses are geared more toward generating customer value, making decisions and predictions using highly demanding technologies like Artificial Intelligence (AI), Natural Language Processing (NLP) and Machine Learning (ML). Naturally, the demand for data is rapidly growing.
To train highly accurate machine learning models, extensive amounts of data known as data corpus are essential. In this guide, you’ll be introduced to the data corpus concept and its importance in ML and AI domains, how the data corpus and machine learning capabilities are adopted by databases like MySQL (and MySQL wire-compatible databases) and how as a MySQL wire-compatible database, SingleStoreDB leads the market with high-demanding machine learning & artificial intelligence capabilities with immense support for data corpus.
The Corpus of Data
Read Post

Data Intensity
Seeking a Rescue from a Traditional RDBMS
In the Beginning
Years ago, organizations used transactional databases to run analytics. Database administrators struggled to set up and maintain OLAP cubes or tune report queries. Monthly reporting cycles would slow or impact application performance because all the data was in one system. The introduction of custom hardware, appliance-based solutions helped mitigate these issues, and the resulting solutions were transactional databases with column store engines that were fast. Stemming from these changes, several data warehouse solutions sprang up from Oracle, IBM Netezza, Microsoft, SAP, Teradata, and HP Vertica, but these data warehouses were designed for the requirements of 20 years ago. Thus new challenges arose, including:
Ease of use – each environment required specialty services to setup, configure, and tune
Expensive – initial investments were high and needed additional capacity
Scalability – performance was designed on single box configurations; the larger the box, the faster the data warehouse
Batch ingestion – inability to store and analyze streaming data in real-time
As new data or user requests landed on the system, database administrators (DBA) had to scale the system up from a hardware perspective. Need more scale? Buy more boxes! DBAs became tired of having to buy new expensive hardware every time their query was slow or every time they had to ingest from a new data source.
An Explosion of Data
The data warehouse appliance was passable back then, but today, new workloads and data growth have put a strain on traditional solutions to the point where many users are seeking rescue from the clutches of incumbent systems. Explosive data growth due to web and mobile application interactions, customer data, machine sensors, video telemetry, and cheap storage means customers are storing “everything,” this has contributed to the additional strain on traditional systems. Real-time application data, now pervasive in digital business, along with new machine and user generated data puts an increasing pressure on ingestion and query performance requirements. Real-world examples include digital personalization required of retailers, customer 360 programs, real-time IoT applications, and real-time logistic applications. To solve for the increased strain, there has been a strategic shift to cloud and distributed systems for agility and cost optimization.
Read Post

Data Intensity
Real-Time Data Warehousing for the Real-Time Economy
In the age of manual decision making based on predictable data formats, data feeds, and batch processing times, enterprise businesses stayed current with ad hoc analyses and periodic reports. To generate analyses and reports, businesses relied on the traditional data warehouse. Using extraction, transformation, and load batch processes, the traditional data warehouse standardized disparate data into normalized schemas and pre-computed cubes. With the data shaped into pre-configured dimensions and aggregated facts, enterprises made historical data and analyses available to front-line decision makers.
In today’s age of digital business, machine learning, and artificial intelligence, decision making for companies is now very different. It is based on unpredictable data formats, massive data scale, event messaging, stream processing, models training, historical analyses, predicitve analytics, and real-time dashboards.
The Enterprise Shift to Real-Time
Today’s real-time economy is segmented, scored, personalized, appified, monetized, and data fueled. Aligning today’s enterprise business with the on-demand, digital economy requires real-time data storage and analytics.
“Companies like Pinterest, Uber, and Pandora have achieved significant performance advantages by shifting from a batch process to a continual load of data.” –
> Eric Frenkiel, SingleStore co-founder and CEO
The real-time data warehouse embraces perpetual data ingest, simultaneous reads, high user concurrency, and fast queries. This new data warehouse continually loads and transforms data, one transaction per event or message with exactly-once semantics. The real-time data warehouse is the foundation for today’s enterprise business that is synchronized with the market and its profit-making opportunities.
The Real-Time Enterprise Gets Its Data Warehouse
In a SingleStore data warehouse, where there is continual data ingest at massive scale, disparate business data is made whole for the various applications running on top of the data warehouse.
“For many enterprises today, moving to real time is the next logical step, and for some others it’s the only option.” –
> Nikita Shamgunov, SingleStore co-founder and CTO.
Data analysts run real-time ad-hoc analyses in sub-seconds. Data scientists train and evaluate machine learning models within minutes. Apps predicatively score and deliver personalized experiences. Armed with a customer 360 degree view of critical business metrics, front-line decision makers automate actions that drive value.
Where Real-Time Data Warehousing Matters
In today’s market, enterprises need the flexibility to migrate, expand, and burst their storage and compute capacity both on-premises and in the cloud.
“SingleStore is the most flexible data warehouse in the market today.” –
> Eric Frenkiel, SingleStore co-founder and CEO
With SingleStore, businesses push real-time workloads where it is the most economical to run them. For companies succeeding in the real-time economy, SingleStore is the hybrid cloud data warehouse that offers these critical operations and cost-savings advantages.
Learn more about how and where real-time data warehousing matters from SingleStore co-founders, Eric Frenkiel (CEO) and Nikita Shamgunov (CTO), in this insightful video.
Read Post

Data Intensity
Video: Scoring Machine Learning Models at Scale
At Strata+Hadoop World, SingleStore Software Engineer, John Bowler shared two ways of making production data pipelines in SingleStore:
**1) Using Spark for general purpose computation
Through a transform defined in SingleStore pipeline for general purpose computation**
In the video below, John runs a live demonstration of SingleStore and Apache Spark for entity resolution and fraud detection across a dataset composed of a hundred thousand employees and fifty million customers. John uses SingleStore and writes a Spark job along with an open source entity resolution library called Duke to sort through and score combinations of customer and employee data.
SingleStore makes this possible by reducing network overhead through the SingleStore Spark Connector along with native geospatial capabilities. John finds the top 10 million flagged customer and employee pairs across 5 trillion possible combinations in only three minutes. Finally, John uses SingleStore Pipelines and TensorFlow to write a machine learning Python script that accurately identifies thousands of handwritten numbers after training the model in seconds.
Read Post

Data Intensity
The Analytics Race Amongst The World’s Largest Companies
The Analytics Race Amongst The World’s Most Valuable Companies
Read Post

Data Intensity
Real-Time and The Rise of Nano-Marketing
The tracking and targeting of our online lives is no secret. Once we browse to a pair of shoes on a website, we are reminded about them in a retargeting campaign. Lesser known efforts happen behind the scenes to accumulate data and scan through it in realtime, delivering the perfect personalized campaign. Specificity and speed are converging to deliver nano-marketing.
If you are a business leader, you’ll want to stay versed in these latest approaches. If not, as a consumer, you’ll likely want to understand how brands are enabling their craft to you personally.
Brands seek specific customer interactions. If you sign up for a retailer’s newsletter, you might receive a preferences questionnaire so they can tailor everything to your specific wants or needs.
But speed also matters, as many of the largest marketing-driven industries like fashion, TV, movies, and music, depend on relevancy in the moment. Being current is currency itself. Only through real-time interaction can this be achieved.
Looking ahead, leaders of digital initiatives will expand their focus from today’s notion of personalized marketing to “nano-marketing” using tools to predict granular audience cohorts on the fly and prescribe individualized marketing experiences in real time. Brands can increase customer experience directly through context, by individual interaction, and instantaneously.
For example, when you walk into a furniture showroom where you also have an online account, the sales representative should know what you were searching for before you arrived, even if it was just a few hours ago. And they should have easy access to your Pinterest page if you’ve made that public. These are the types of experiences we can expect in the future with nano-marketing.
Behind nano-marketing, taking personalized marketing to the next level
The concept behind personalized marketing is hardly new. Brands have always strived to create special experiences for customers in order to entice them to return. With the creation of Customer Relationship Management (CRM) systems and the proliferation of social media, this idea has become even more popular.
Marketing to customer segments of one merges existing and new disciplines to the trade. The low bar for what currently qualifies as “personalized marketing” will soon rise with the advent of tools that allow finer granularity, faster.
Looking ahead, we can expect three areas of marketing innovation:
The Autonomous Marketing Stack
Marketers have a plethora of available tools across infrastructure and analytics, including platforms like Salesforce.com, Marketo, Eloqua, Omniture, Google Analytics, and dozens of more specialized offerings. Truthfully the availability of special purpose tools has outstripped the individual’s ability to integrate them.
In the coming years, we’ll move far beyond just cobbling together the tools that help us be more efficient and cater to our customers; we’ll have a marketing tool stack that implements and executes campaigns on its own.
Imagine a system that watches social feeds for popular items, aggregates existing content to resurface it into the discussion, and kicks off a set of new content assets to carry the conversation forward. And this happens between Friday and Sunday with little human effort.
Virtual Reality Is The New Content
Today marketers often focus on generating a considerable amount of written content. Tomorrow they will put the pen down and focus on virtual experiences for customers that allow them to interact with content in ways not possible before. With attention spans getting shorter, and the firehose of new content bombarding customers, brands will need to focus on things that don’t just inform, but also entertain.
Whereas today an automobile company might customize regional billboards to fit with the landscape, soon they will offer tailored virtual reality experiences in a city and driving venue of your choice.
Where The Real-Time Meets The Road
Finally all of this will come together in the insatiable pursuit of instant gratification. Not only will consumers not be surprised by real-time results, they will come to demand it. To stay on top, marketers, and the tools they use, will need to absorb, process, and contextualize information more quickly to deliver unique interactive experiences. This is already happening in areas like ad tech and finance, but stay tuned as the latest in real-time technologies work their way across all industries.
Read Post

Data Intensity
From Big to Now: The Changing Face of Data
Data is changing. You knew that. But the dialog over the past 10 years around big data and Hadoop is rapidly moving to data and real-time.
We have tackled how to capture big data at scale. We can thank the Hadoop Distributed File System for that, as well as cloud object stores like AWS S3.
But we have not yet tackled the instant results part of big data. For that we need more. But first, some history.
Turning Point for the Traditional Data Warehouse
Internet scale workloads that emerged in the past ten years threw the traditional data warehouse model for a loop. Specifically, the last generation of data warehouses relied on
Scale up models; andAppliance approaches
Vast amounts of internet and mobile data have made these prior approaches uneconomical, and today customers suffer the high cost of conventional data warehouses.
Read Post

Data Intensity
An Engineering View on Real-Time Machine Learning
About Thorn
Thorn partners across the tech industry, government and NGOs, building technology to combat predatory behavior, identify victims, and protect vulnerable children.
About Eric Boutin
Eric leads an engineering team for SingleStore in our Seattle office. This is background information from Eric on our work with Thorn.
How did you first get connected with Thorn?
I was introduced to Federico Gomez Suarez, a volunteer working with Thorn, by a common friend. I was impressed by the work Thorn was doing, and excited about the opportunity to help them.
What specific technical challenges did you see as opportunities?
Thorn was working on face recognition and machine learning work to analyze pictures on the internet to protect vulnerable children. The main technical challenge they had however, was to match the fingerprint of a picture with the fingerprints of an extremely large number of other pictures. Thorn needs to match a very large number of pictures per second, all in real time, with a gigantic database of pictures that is constantly being updated.
What connections were you able to draw to SingleStore capabilities?
The fingerprint matching problem seemed like a natural match for SingleStore. The dataset is too large to fit in one machine, and very high parallelism is required to match pictures in real time. While the process of extracting the fingerprint out of an image is extremely complex, the process of matching fingerprints consists of linear operations of vectors. The difficulty here is the vast mountain of changing data that has to be processed in real time, and to me, this looked a lot more like a database problem than a just a machine learning problem. More specifically, it seemed like a perfect use case for SingleStore.
Did you have to develop anything for SingleStore so Thorn could succeed?
Overall the distributed and parallel architecture of SingleStore was a natural fit for the problem that Thorn needed to solve. The only gap was the ability to do linear algebra operations on vectors in order to match image fingerprints. I added database operators to perform the required linear algebra operations. Given the steep performance requirement, I used the AVX2 instructions set to implement the linear algebra operations to minimize the latency. A few hours later I was able to test real time fingerprint matching at scale.
What improvements were made possible for Thorn by using SingleStore?
When we started the project they didn’t have a solution to the problem of matching the fingerprint of images. Thorn was investigating a number of approaches, but they had not yet found an approach which would match image fingerprints in real time. Those improvements are enabling them to move forward with the project which will in turn protect children more effectively.
How might this work apply to other use cases or industries?
The key insight from this project is that by adding basic linear algebra operations to the SQL language, any machine learning system using models that can be evaluated by using linear algebra (logistic regression, linear regression, k-mean or k-nn using euclidean or cosinusoidal distance) could be evaluated directly in a SQL query. For example, Click Through Rate prediction is a machine learning problem where a website is trying to predict which advertisement has the highest probability of being clicked on. The problem can be modeled as a linear regression between one user and a large number of ads, and the ads with the highest probability of being clicked on is picked. Logistic regression actually consists of a simple dot product between the ‘ad’ vector and the ‘user’ vector followed by a few scalar operations. We can imagine applications where the same database is being used for click through prediction, as well as business intelligence and analytics on the real time stream of clicks and impressions. In a few line of SQL the user could express ‘select the ad with the higher predicted click through rate for a given user, from an advertiser that still has enough budget’. In the same transaction, the application could then deduct money from the advertisers budget to account for the impression.
What do you see next in terms of new innovations in this arena?
I would like this field to innovate from two different directions. On the one hand, I would like to see databases support more and more algebra primitives to allow expressing more complex machine learning models. For example, supporting matrices, vector/matrix operators, aggregation across multiple vectors, and so on. This would allow expressing a growing number of machine learning algorithms in SQL. Even neural networks can be expressed as a sequence of scalar and vector operations. I would then like to see machine learning framework ‘push down’ algorithms into databases using SQL. Today Business Intelligence tools commonly push down joins and filtering into databases to leverage their high performance query processing engine. We could see machine learning frameworks push down parts of the algorithm (compute the gradient of the error for example) as a SQL query in the database engine to more effectively process data at scale.
Read Post

Data Intensity
Video: Building the Ideal Stack for Real-Time Analytics
Building a real-time application starts with connecting the pieces of your data pipeline.
To make fast and informed decisions, organizations need to rapidly ingest application data, transform it into a digestible format, store it, and make it easily accessible. All at sub-second speed.
A typical real-time data pipeline is architected as follows:
Application data is ingested through a distributed messaging system to capture and publish feeds.A transformation tier is called to distill information, enrich data, and deliver the right formats.Data is stored in an operational (real-time) data warehouse for persistence, easy application development, and analytics.From there, data can be queried with SQL to power real-time dashboards.
As new applications generate increased data complexity and volume, it is important to build an infrastructure for fast data analysis that enables benefits like real-time dashboards, predictive analytics, and machine learning.
At this year’s Spark Summit East, SingleStore Product Manager, Steven Camina shared how to build an ideal technology stack to enable real-time analytics.
Video: Building the Ideal Stack for Real-Time Analytics
Read Post

Data Intensity
Turning Amazon S3 Into a Real-Time Analytics Pipeline
SingleStoreDB Self-Managed 5.7 introduces a new pipeline extractor for Amazon Simple Storage Service (S3). Many modern applications interface with Amazon S3 to store data objects into buckets up to 5TB providing a new modern approach for today’s enterprise data lake.
Without analytics, the data is just a bunch of files
For modern enterprise data warehouses, the challenge is to harness the unlimited nature of S3 for ad-hoc and real-time analytics. For traditional data warehouse applications, extracting data from S3 requires additional services and background jobs that monitor buckets for new objects and then load those objects for reporting and analysis. Eliminating duplicates, handling errors, and applying transformations to the retrieved objects often requires extensive coding, middleware, or additional Amazon offerings.
From data lake to real-time data warehouse
A SingleStore S3 Pipeline extracts data from a bucket’s objects, transforms the data as required, and loads the transformed data to columnstore and rowstore. SingleStore Pipelines use the power of distributed processing and in-memory computing to extract, transform, and load external data in parallel to each database partition to achieve exactly-once semantics.
To stream existing and new S3 objects while querying the streaming data at sub-second performance, a SingleStore S3 Pipeline runs perpetually. Rapid and continuous data ingest for real-time analytic queries is a native component of SingleStore. The constant data ingest allows you to deliver real-time analytics with ANSI SQL and power business intelligence applications like Looker, ZoomData, or Tableau.
SingleStore Pipelines are a first class database citizen. Database developers and administrators can easily create, test, alter, start, stop, and configure pipelines with basic data definition language (DDL) statements or use a graphical user interface (GUI) in SingleStore Ops.
Excited to get started with SingleStore S3 Pipelines? Follow these steps:
Open an AWS account. AWS offers an AWS Free Tier that includes 5 GB of Amazon S3 Storage, including 20,000 Get Requests and 2,000 Put Requests.
Download a 30-day free trial of the SingleStore Enterprise Edition or use the SingleStore Official Docker Image to run SingleStore.
With an available cluster running, create your first SingleStore S3 Pipeline using our S3 Pipelines Quickstart. The guide covers creating S3 buckets, a SingleStore database, and most importantly, a SingleStore S3 Pipeline.
Read Post

Data Intensity
How to Move Analytics to Real-Time
3x Spend Increase
“Between 2016 and 2019, spending on real-time analytics will grow three times faster than spending on non-real-time analytics.”
Every organization uses some form of analytics to monitor and improve their business. The growth of data has increased the impact of analytics and is a critical ingredient for delivering a successful digital business strategy.
Companies are using more real-time analytics, because of the pressure to increase the speed and accuracy of business processes – particularly for digital business and the Internet of Things (IoT).
– How to Move Analytics to Real Time, by W. Roy Schulte, Gartner, September 2016
In the Gartner report, “How to Move Analytics to Real-Time”, by W. Roy Schulte, there are several recommendations to guide your data management evolution from historical based analytics to a real-time system.
Relevance of Real-Time for your Business
Determining the right “Real-Time” approach for your business is an important first step, ensuring the objectives of the solution are aligned with the business outcome. In the Gartner report, Roy describes two types of real-time systems:
“Engineering real time is most relevant when dealing with machines and fully automated
applications that require a precise sequence and timing of interactions among multiple components”
“Business real time is about situation awareness; sensing and responding to what is happening
in the world now, rather than to what happened a few hours or days ago, or what is predicted to
happen based on historical data.”
Technology and Design Patterns for Real-Time
A real-time system must process data sets very quickly. As data grows, there are several technologies and techniques to consider including in-memory databases, parallel processing, efficient algorithms, and innovative data architectures.
Match the Speed of Analytics to the Speed of the Business Decision
To ensure the proper investment return of a real-time system, the response of the analytics must align with the speed of the decision. Two questions determine the proper speed for your system.
How quickly will the value of the decision degrade?How much better will a decision be if more time is spent?
Automate Decisions if Algorithms Can Represent the Entire Decision Logic
Read Post