
Data Intensity8 min Read
Partner Repost: Using Streaming Analytics to Identify and Visualise Fraudulent ATM Transactions in Real-Time
Storing and analysing larger and larger amounts of data is no longer what drives the decision-making process at successful companies. In…
Read Post

Data Intensity5 min Read
Webinar: Building an Analytics App with SingleStore
In this webinar, SingleStore Product Manager Jacky Liang took a live audience through the process of building an analytics app in just a few…
Read Post

Data Intensity3 min Read
How to Build Real-Time Dashboards at Scale
With a real-time dashboard, you can stop problems before they start and seize opportunities that your competition is not aware of yet…
Read Post

Data Intensity1 min Read
Free Download: Designing Data-Intensive Applications
Designing Data-Intensive Applications, as a complete book, is more than 500 pages long. It takes as its premise that data is at the center…
Read Post
Data Intensity4 min Read
Selecting the Right Database for Your Scale Demands
Scaling distributed systems is hard. Scaling a distributed database is really hard. Databases are particularly hard to scale because there…
Read Post

Case Studies7 min Read
Epigen Powers Facial Recognition in the Cloud with SingleStore – Case Study
Epigen Technology depends heavily on SingleStore as part of its core toolkit. “Without SingleStore, I can’t do what I do,” says Terry Rice…
Read Post

Data Intensity4 min Read
Webinar: Data Lake Advances in the Age of Operational ML/AI
Bill Vorhies of Data Science Central and Rick Negrin of SingleStore recently delivered a webinar on data lakes and the use of Hadoop in the…
Read Post

Data Intensity2 min Read
Webinar: How Kafka and Modern Databases Benefit Apps and Analytics
Apache Kafka is widely used to transmit and store messages between applications. Kafka is fast and scalable, like SingleStore. SingleStore…
Read Post

Data Intensity3 min Read
Diversifying Your Enterprise Tech Stack
Companies are living organisms. As they mature they tend to slow down a bit and align to familiar routines. This pattern has the benefit of…
Read Post
Data Intensity1 min Read
Visual Takeaways from Gartner Data and Analytics 2018
We attended the Gartner Data and Analytics Summit in Grapevine, Texas in early March. This series is part of its global events schedule and…
Read Post

Data Intensity4 min Read
Go Beyond Legacy Data with Change Data Capture, SingleStore, and Real-Time Applications
Data is driving innovative customer experiences, operation optimization, and new revenue streams. Data infrastructure teams are being asked…
Read Post

Data Intensity4 min Read
Machine Learning and SingleStore
What is Machine Learning? Machine learning (ML) is a method of analyzing data using an analytical model that is built automatically, or…
Read Post

Data Intensity6 min Read
Using SingleStore within the AWS Ecosystem
The database market is large and filled with many solutions. In this post, we will take a look at what is happening within AWS, the overall…
Read Post

Data Intensity3 min Read
Data Warehouses and the Flying Car Dilemma
Traditional data warehouses and databases were built for workloads that manifested 20 years ago. They are sufficient for what they were…
Read Post

Data Intensity
How Database Convergence Impacts the Coming Decades of Data Management
Within the database industry, there are often paradigm shifts in the market that create opportunities for new types of technology to emerge. When those shifts happen, new technology requirements come up and old technologies do not satisfy those requirements. There have been many shifts in the database market including, NoSQL, big data, in-memory, Internet of Things, cloud computing, and many others.
Recently, SingleStore CEO and Co-founder, Nikita Shamgunov, presented to a group of database professionals at the New York City Database Month meetup group about these shifts, and specifically, how database convergence will impact the coming decades of data management. He discussed the latest innovations in data management, and how utilizing a distributed and scalable converged database can optimize transactional and analytical workloads. He also showed the technical impact of real-time data pipelines and scalable SQL in a distributed computing platform that is designed for addressing challenging financial application scenarios.
Watch to view a demo of large scale data (several billion rows) that is processing on Volume Weighted Average Price (VWAP) and real-time image recognition.
Read Post

Data Intensity
Why You Need a Real-Time Data Warehouse for IoT Applications
As always-on devices and sensors proliferate, the data emitted from these devices provides meaningful insights to improve customer experiences, optimize costs, and identify new revenue opportunities. In a recent report, Taking the Pulse of Enterprise IoT from McKinsey & Company, 48 percent of respondents cited “managing data” as a critical capability gap related to their IoT initiatives.1
The data infrastructure behind IoT applications requires a high performing and easy-to-access platform to support immediate responses to changing conditions. At the center of an IoT data infrastructure platform there needs to be a database that supports stream data ingestion with familiar SQL query access on a scalable, highly available platform.
SingleStore powers a number of IoT applications to help manage operating costs while improving the customer experience. These applications require core database capabilities including:
Streaming data ingest and store
The database must collect and store multiple streams of data into relational formats to support real-time and historical analysis. Ingestion often requires inserts, updates, and deletes to ensure data accuracy.
Fast query response
Perform instant queries across millions of events or devices to discover real-time anomalies or predict events leveraging historical data using memory-optimized SQL.
Proven compatibility
Leverage the familiarity of ANSI SQL with full data persistence to drive sophisticated analytics while seamlessly working with existing business intelligence and middleware tools.
Scalability and availability
Utilize a modern shared nothing architecture to scale out with industry-standard hardware. Built-in resilience keeps the database online across cloud or on-premises deployments.
SingleStore has been able to help industry-leading organizations such as Uber, Verizon, Comcast, and Cisco deliver IoT applications powered by analytics at scale. These applications include:
Real-Time monitoring and detection, which can be used to manage networks and devices with instant insights to live conditions to improve customer experience while mitigating costs.
Predictive maintenance applications, which can identify potential issues before they arise to prevent outages or improve asset management for oil pumps, wind farms, vehicles, and more.
Fleet optimization, which can help optimize cost by identifying the location and condition of every truck or car to streamline delivery or improve customer satisfaction.
Learn more about the real-time analytic solutions for IoT applications by downloading our new IoT Analytic Solution Guide.
Read Post

Data Intensity
Designing for a Database: What’s Beyond the Query?
Even the most technically-minded companies need to think about design. Working on a database product at a startup is no different. But this comes with challenges, such as figuring out how to implement the human-centered design methodology at a technical company, but also contribute to building a design process that everyone agrees with across the organization. This blog will detail how product design is done at SingleStore as well as highlight how to design enterprise products at a startup.
How do we define Product Design?
Product Design is the end-to-end process of gathering requests and ideating in order to hand-off pixel-perfect products and iterations. There are usually two different ways of dividing work: breaking down multiple design tasks into steps handled by different team members, or each designer taking ownership of a product, or a feature, and designing in a full-stack way. At SingleStore, we develop each product with the latter model, which requires constant and proactive engagement with other folks on the team.
Because everything our users experience when interacting with our product should ideally be counted in the scope of ‘Product Design’; we wouldn’t ignore email templates, docs, the help center, the support chat widget and so on. By thinking about the product as a holistic ecosystem, we are able to capture the pain points both inside and beyond the stand-alone application.
What is the process?
Below are the steps we work through at SingleStore before our customers see any end product or piece of collateral.
Kick-off
Most of the product features are requested by product management (PM). PM gathers feedback directly from customers or other departments, such as customer success and sales. Designers are also encouraged to take initiative, and lead their own project if it’s valuable to the product as well as the business. To start the project, designers need to write a proposal doc and send it to key stakeholders. The stakeholders will evaluate the idea and decide if it should be put in the roadmap or not.
Research
Before jumping onto the design phase, we do research to validate the goal, gather technical requirements, explore the market, understand our customer, and analyze what we have done so far.
Getting involved with the engineering team
First, we need to talk with the engineering team to understand the technical principles, expectations, difficulties, constraints, and context. This helps us set a clear scope of what we can implement, or not, in the project.
Study your customers and beyond
Customer study covers a wide range of topics. Various tools and methodologies are used, such as interviews, surveys, and focus groups. Notice that customer study is different from usability testing, which will be mentioned in the ‘Testing’ section. While usability testing focuses on the ease of use of the product, the customer study is expected to answer more strategic questions and help us evaluate the direction of the product.
Read Post

Data Intensity
Real-Time Streaming, Analytics, and Visualization with SingleStore and Zoomdata
We regularly host meetups at SingleStore headquarters as a way to share what we have been working on, connect with the community, and get in-person feedback from people like you. We also invite partners and customers to join us as well. Recently, we had the pleasure of hosting a meetup with Zoomdata, where we shared two presentations on real-time streaming, analytics, and visualization using SingleStore and Zoomdata.
Read Post

Data Intensity
Delivering Scalable Self-Service Analytics
Within 48 hours of launching Google Analytics as a free product, virtually all of Google’s servers crashed. Eric Schmidt called this Google’s “most successful disaster.” Why would a free product, whose hardware requirements melted down a datacenter, be worth it? Wesley Chan, the creator of Google Analytics, later said that, “Google Analytics generates about three billion dollars in extra revenue,” as noted in Steven Levy’s book, In The Plex. Google Analytics allowed Google’s customers to measure how good AdWords actually were, and showed them, with their own data, exactly how high they could increase bids and still make money. As Chan said, “know more, spend more.”
The full potential of such an offering comes when customers are allowed to arbitrarily segment and calculate flexible aggregates. To do that, they need to be able to query raw unaggregated data. This way your company does not have to guess what the customer wants to aggregate, as all choices remain available. If raw data access is not provided, then data must be precomputed, at least on some dimensions, which limits flexibility and the extent of the insights users can get from data.
Cost and technology constraints have led most companies to build analytics with this precompute approach for customers, because they need to serve analytics to many customers concurrently. The scale required to offer raw data access remained untenable. It was unthinkable to perform computations on raw data points on the scale of billions of rows per request concurrently for thousands of customers.
Today, SingleStore is changing that conventional wisdom and offering companies the ability to serve raw unaggregated data performance to a range of customers.
To explain this capability further, there are three major pieces of technology in this use case:
Scale-outColumnstore query executionEfficient data isolation
Scale-Out
Knowing system performance characteristics on a per-core basis, users can calculate how much compute and storage is needed to serve analytics at scale. Once that calculation is done, the key is to utilize a distributed system allowing enough dedicated compute power to meet demand. SingleStore can be used to run one to hundreds of nodes, which lets users scale the performance appropriately.
For example, if you have a million customers with one million data points each, you can say that you have one trillion data points. Imagine that at the peak, one thousand of those customers are looking at the dashboard simultaneously – essentially firing off one thousand concurrent queries against the database. Columnstore compression can store these trillion rows on a relatively small SingleStore cluster with approximately 20 nodes. Conservatively, SingleStore can scan 100 million rows per second per core, which mean that just one core can service 100 concurrent queries scanning one million rows each, and deliver sub-second results for analytical queries over raw data – below we will provide a benchmark for a columnstore query execution performance.
Columnstore Query Execution
A simple query over a columnstore table, such as a `GROUP BY`, can run at a rate of hundreds of millions to over a billion data points per second per core.
To demonstrate this, we loaded a public dataset about every airline flight in the United States from 1987 until 2015. As the goal was to understand performance per core, we loaded this into a single node SingleStore cluster running on a 4 core, 8 thread Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz.
To repeat this experiment, download the data using the following bash script:
mkdir csv
for s in `seq 1987 2015`
do
for m in `seq 1 12`
do
wget http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_${s} _${m} .zip
done
done
Create this table:
CREATE TABLE ontime (
Year INT,
Quarter INT,
Month INT,
DayofMonth INT,
DayOfWeek INT,
FlightDate Date,
UniqueCarrier Varchar(100),
AirlineID INT,
Carrier Varchar(100),
TailNum Varchar(100),
FlightNum Varchar(100),
OriginAirportID INT,
OriginAirportSeqID INT,
OriginCityMarketID INT,
Origin Varchar(100),
OriginCityName Varchar(100),
OriginState Varchar(100),
OriginStateFips Varchar(100),
OriginStateName Varchar(100),
OriginWac INT,
DestAirportID INT,
DestAirportSeqID INT,
DestCityMarketID INT,
Dest Varchar(100),
DestCityName Varchar(100),
DestState Varchar(100),
DestStateFips Varchar(100),
DestStateName Varchar(100),
DestWac INT,
CRSDepTime INT,
DepTime INT,
DepDelay INT,
DepDelayMinutes INT,
DepDel15 INT,
DepartureDelayGroups Varchar(100),
DepTimeBlk Varchar(100),
TaxiOut INT,
WheelsOff INT,
WheelsOn INT,
TaxiIn INT,
CRSArrTime INT,
ArrTime INT,
ArrDelay INT,
ArrDelayMinutes INT,
ArrDel15 INT,
ArrivalDelayGroups INT,
ArrTimeBlk Varchar(100),
Cancelled INT,
CancellationCode Varchar(100),
Diverted INT,
CRSElapsedTime INT,
ActualElapsedTime INT,
AirTime INT,
Flights INT,
Distance INT,
DistanceGroup INT,
CarrierDelay INT,
WeatherDelay INT,
NASDelay INT,
SecurityDelay INT,
LateAircraftDelay INT,
FirstDepTime Varchar(100),
TotalAddGTime Varchar(100),
LongestAddGTime Varchar(100),
DivAirportLandings Varchar(100),
DivReachedDest Varchar(100),
DivActualElapsedTime Varchar(100),
DivArrDelay Varchar(100),
DivDistance Varchar(100),
Div1Airport Varchar(100),
Div1AirportID INT,
Div1AirportSeqID INT,
Div1WheelsOn Varchar(100),
Div1TotalGTime Varchar(100),
Div1LongestGTime Varchar(100),
Div1WheelsOff Varchar(100),
Div1TailNum Varchar(100),
Div2Airport Varchar(100),
Div2AirportID INT,
Div2AirportSeqID INT,
Div2WheelsOn Varchar(100),
Div2TotalGTime Varchar(100),
Div2LongestGTime Varchar(100),
Div2WheelsOff Varchar(100),
Div2TailNum Varchar(100),
Div3Airport Varchar(100),
Div3AirportID INT,
Div3AirportSeqID INT,
Div3WheelsOn Varchar(100),
Div3TotalGTime Varchar(100),
Div3LongestGTime Varchar(100),
Div3WheelsOff Varchar(100),
Div3TailNum Varchar(100),
Div4Airport Varchar(100),
Div4AirportID INT,
Div4AirportSeqID INT,
Div4WheelsOn Varchar(100),
Div4TotalGTime Varchar(100),
Div4LongestGTime Varchar(100),
Div4WheelsOff Varchar(100),
Div4TailNum Varchar(100),
Div5Airport Varchar(100),
Div5AirportID INT,
Div5AirportSeqID INT,
Div5WheelsOn Varchar(100),
Div5TotalGTime Varchar(100),
Div5LongestGTime Varchar(100),
Div5WheelsOff Varchar(100),
Div5TailNum Varchar(100),
key (AirlineID) using clustered columnstore
);
Then load data into the table:
`load data infile '/home/memsql/csv/*' into table ontime fields terminated by ',' enclosed by '"' lines terminated by ',\n' ignore 1 lines;`
Once the data is loaded, run a simple group by command. The following query performs a full table scan:
SELECT OriginCityName, count(*) AS flights
FROM ontime GROUP BY OriginCityName ORDER BY flights DESC LIMIT 20;
On a machine with 4 cores, a 164 million row dataset query runs in 0.04 seconds which is 1 billion rows per second per core. No, that’s not a typo. That’s a billion rows per second per core. More complex queries will consume more CPU cycles, but with this level of baseline performance there is a lot of room across a cluster of 8, 16, or even hundreds of machines to handle multi-billion row datasets with response times under a quarter of a second. At that speed, queries appear to be instantaneous to users, leading to great user satisfaction.
Try this example using SingleStoreDB Self-Managed 6. New vectorized query execution techniques in SingleStoreDB Self-Managed 6, using SIMD and operations directly on encoded (compressed) data, make this speed possible.
Efficient Data Isolation Per Customer
Data warehouses such as Redshift and Big Query support large scale, but may not sufficiently isolate different queries in highly concurrent workloads. On top of that, both have a substantial fixed overhead on a per query basis. Redshift in particular does not support many concurrent queries: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html.
Depending on the analytical requirements, SingleStore allows for an ordered and partitioned physical data layout to ensure only scanned data belongs to a single customer. In our example, the columnstore was clustered on AirlineID.
SingleStore supports clustered columnstore keys that allow global sorting of columnstore tables. In this case, if you have a predicate on AirlineID a user will only scan the subset of data belonging to that airline. This allows SingleStore to deliver on very high concurrency (in the thousands of concurrent queries) with each query scanning and aggregating millions of data points.
More on Query Execution
At SingleStore, we are continuously innovating with new query processing capabilities. This is a list of recent innovations in our shipping product: https://archived.docs.singlestore.com/v6.0/release-notes/memsql/60-release-notes/.
Bringing it All Together
Going back to our original example, though our dataset is one trillion rows, because of the clustered columnstore key, each customer only needs to scan through one million rows. For a simple query like the above, scanning 500 million rows per second per core means that a single CPU core could support 500 concurrent queries and deliver sub-second performance.
To recreate the work mentioned in this blog, try out SingleStoreDB Self-Managed 6: singlestore.com/free.
Read Post

Data Intensity
Key Considerations for a Cloud Data Warehouse
Data growth and diversity has put new pressures on traditional data warehouses, resulting in a slew of new technology evaluations. The data warehouse landscape offers a variety of options, including popular cloud solutions that offer pay-as-you-go pricing in an easy-to-use and scale package. Here are some considerations to help you select the best cloud data warehouse.
First, Identify Your Use Case
A cloud data warehouse supports numerous use cases for a variety of business needs. Here are some common use cases along with the notable capabilities required for each.
Ad Hoc Analysis
Ad hoc analysis provides guided or open queries to the data warehouse, giving the end user flexibility to explore deeper questions. Users use native SQL or an interactive visual analysis tool such as Tableau or Looker. Each query result often prompts the user to dive further into the data, going from summary or aggregate views into distinct row level detail. A data warehouse that is good at ad hoc analysis delivers fast consistent responses across a variety of query types.
How does a data warehouse support ad hoc analysis?
Efficient query processing that can scan, join, and aggregate data in a variety of table structures.Columnstore table format for optimized disk usage and accelerated aggregate query response.Relational data format with ANSI SQL query syntax provides a familiar, easy to use structured language.Built-in statistical functions such as MAX, MIN, SUM, COUNT, STD, NTILE, and RANK, to name a few, will make it easier to build sophisticated queries.Data security ensures different users are shielded from sensitive or unauthorized data, requiring user authentication, role based access control, and row level security.Scalable concurrency for supporting thousands of users running a variety of queries simultaneously.Native connectivity to leading business intelligence tools for easier visual analysis and collaborative dashboards.
Machine Learning and Data Science
Data science and machine learning use a data warehouse to identify trends, discover hidden data relationships, and predict future events with sophisticated algorithms. Machine learning is a technique that can learn and improve insight discovery without explicitly being programmed to do so. Data scientists will often require large volumes of data to improve their predictions and correlations. Data is often enriched and cleaned or packaged into sample data sets for faster experimentation. Experiments are commonly performed offline due to the intense processing power required for the analysis. Advances in algorithms, hardware, machine learning and artificial intelligence tooling have led to more advanced data processing methods that can automatically identify hard to find events with relatively little human coordination.
How does a data warehouse support machine learning and data science?
Support a variety of data types including relational, CSV, JSON, and geospatial formats.Provide native interoperability with data preparation and statistical tooling, such as Spark, SparkML, Python, R, SAS, and TensorFlow.To maximize resource savings, offer rapid sandbox configuration for quick experimentation with easy spin-up and termination of databases as load requirements change.To support collaboration and sharing of analyses, offer native connectivity with modern business intelligence tools such as Tableau, Zoomdata, and Looker.
Real-Time and Operational Analytics
Operational analytics applications often manage Key Performance Indicators (KPIs) by querying data continuously. The insights might be used several times a day by people or machines. The speed of response for an operational or real-time analytics solution can vary based on the systems in place and the organizational readiness. Gartner’s Roy Schulte said it best in his report, How to Move Analytics to Real Time:
“Business real time is about situation awareness; sensing and responding to what is happening in the world now, rather than to what happened a few hours or days ago, or what is predicted to happen based on historical data.”
How does a data warehouse support real-time analytics?
Streaming ingestion of data that can be immediately queried.Fast processing of repeat queries, potentially by thousands of users or applications.To reduce outages and maintain 24/7 operational support, high availability that includes redundancy and auto-failover.To improve accuracy and decision speeds, exactly once semantics for real-time data de-duplication and enrichment.
Mixed Workload Analytics
Most organizations want a single source of data to improve decision accuracy and support a variety of workloads across ad hoc, machine learning, and real-time analytics. These expanded use cases place a strong emphasis on performance, security, and user or application concurrency. Due to the variety of applications requiring sub-second data access, mixed workloads can be a challenge to tune and govern.
How does a data warehouse support mixed workload analytics?
A robust, efficient, and distributed query processor that can support a broad range of queries without overpaying for extra hardware resources or require hard-to-manage database configurations.Rapid easy-to-scale architecture that can address changes in workload complexity and user concurrency load.Comprehensive security to shield users from seeing sensitive data without requiring custom database schemas or views.Broad data ingestion to support real-time streaming and batch load requirements.
Next Up, Understanding Cloud Data Warehouse Capabilities
As you evaluate your next cloud data warehouse investment, it’s important to know the range of capabilities that are important for your project or business. Below is a list of capabilities organized by category to help you identify the right data warehouse:
Usability
Rapid provisioning: Setup should be self-service and take a few minutes from the point of sign up to a running functioning databaseAccessibility: For easy query processing and integration with existing applications, tools, and skills, the environment should support relational data using ANSI SQLEasy data loading: A guided or integrated data loading process should give users an easy integrated way to deploy a real-time data pipeline or bulk load ingestionOptimized query processing: The database should have a distributed query optimizer that can process most queries with minimal specialized tuningSimplified capacity management: As data or user growth expands, the data warehouse should provide managed or automated capacity adjustment to quickly address changing workloads
Performance
Ingest to analysis: Streaming data ingestion with simultaneous query processing ensures the fastest possible insights on live and historical dataFast queries: Subsecond query response against billions of rows with vectorized query processing and columnstore structure for ad-hoc dashboards or operational reportsOperationally tuned: Compiled SQL queries accelerate query execution for added performance gains
Cost
On-demand pricing: Sometimes a data warehouse is not required for 24/7 operation; hourly billing can tightly associate the usage to payment.Annual discounts: Reserved pricing discounts should be an option for operational deployments that are always available
Flexibility
Multicloud: To maximize the proximity of your data and get the ultimate performance for your applications, you need the freedom to choose the cloud service provider you prefer or have standardized onHybrid cloud: Maintain existing investments by spanning data warehouse investments across on-premises and cloud on a single platformElastic: Driven by growth in data, users, or query sophistication, rapidly scale out or down for new capacity requirementsInteroperable: To ensure compatibility with existing tools, applications, and skills, support JDBC/ODBC connectivity, MySQL wire protocol, and ANSI SQL
Scalability
Concurrency support: Scale-out distributed architecture ensures that high volume ingest and write queries do not degrade dashboard or report performanceHigh Availability: Efficient replication and distributed architecture ensures no single point of failure for operational requirementsDurable: All data should reside on disk for audit or regulatory requirements along with expedited recovery from unexpected failures
Security
Comprehensive: Data should be secured across the analysis lifecycle, from single sign-on (SSO) authentication, role based access control (RBAC), SSL encryption of data over the wire, encryption of data at rest, granular audit logging, and separation of concerns for database administratorsConsistent: Ensure a consistent security model across on-premises to the cloud with strong security capabilities across deployments
Conclusion: Considerations for SingleStoreDB Cloud
SingleStoreDB Cloud offers all the desired capabilities described above as a full-featured cloud data warehouse that is easy to set up and use for supporting a mix of workloads in a single integrated platform. The product delivers a fast, flexible, and secure environment that is capable of analyzing both live and historical data. The pay-as-you-go service gives organizations an affordable approach to real-time analytics. Try SingleStoreDB Cloud today and get a \$300 free credit offer resulting in up to 300 hours of free usage.
Try SingleStoreDB Cloud Now
Read Post

Data Intensity4 min Read
AI’s Secret Weapon: The Data Corpus
Modern businesses are geared more toward generating customer value, making decisions and predictions using highly demanding technologies…
Read Post

Data Intensity
Seeking a Rescue from a Traditional RDBMS
In the Beginning
Years ago, organizations used transactional databases to run analytics. Database administrators struggled to set up and maintain OLAP cubes or tune report queries. Monthly reporting cycles would slow or impact application performance because all the data was in one system. The introduction of custom hardware, appliance-based solutions helped mitigate these issues, and the resulting solutions were transactional databases with column store engines that were fast. Stemming from these changes, several data warehouse solutions sprang up from Oracle, IBM Netezza, Microsoft, SAP, Teradata, and HP Vertica, but these data warehouses were designed for the requirements of 20 years ago. Thus new challenges arose, including:
Ease of use – each environment required specialty services to setup, configure, and tune
Expensive – initial investments were high and needed additional capacity
Scalability – performance was designed on single box configurations; the larger the box, the faster the data warehouse
Batch ingestion – inability to store and analyze streaming data in real-time
As new data or user requests landed on the system, database administrators (DBA) had to scale the system up from a hardware perspective. Need more scale? Buy more boxes! DBAs became tired of having to buy new expensive hardware every time their query was slow or every time they had to ingest from a new data source.
An Explosion of Data
The data warehouse appliance was passable back then, but today, new workloads and data growth have put a strain on traditional solutions to the point where many users are seeking rescue from the clutches of incumbent systems. Explosive data growth due to web and mobile application interactions, customer data, machine sensors, video telemetry, and cheap storage means customers are storing “everything,” this has contributed to the additional strain on traditional systems. Real-time application data, now pervasive in digital business, along with new machine and user generated data puts an increasing pressure on ingestion and query performance requirements. Real-world examples include digital personalization required of retailers, customer 360 programs, real-time IoT applications, and real-time logistic applications. To solve for the increased strain, there has been a strategic shift to cloud and distributed systems for agility and cost optimization.
Read Post