Data-Intensive Applications Need A Modern Data Infrastructure

Gone are the days when applications were installed locally, had a handful of users at any one time, and only focused on basic data entry and retrieval. Modern applications live in the cloud and access and generate large amounts of data. This data needs to be aggregated, summarized, and processed - and presented to the user in a way that is understandable, interactive, and served up in real-time. From the user perspective, a positive experience depends on data being highly available, consistent and secure - without compromising performance. To meet these needs, modern, data-intensive applications need a modern data infrastructure that can scale seamlessly as user numbers grow.

Applications, Defined

By applications, we really mean services (as in Software-as-a-Service) as nearly all new modern applications are now being built as services. With the move to microservices architecture, the boundary around what is an “application” becomes somewhat fuzzy. For purposes of this article, it is everything that goes into delivering the end-user experience. This includes the UX as well as all the backend services that make that UX possible.

Applications have existed since the beginnings of the computing era and, at their heart, allow a user to accomplish a task. For example, any smartphone has many apps, all for specific tasks. The Uber app to call a car, a banking app to check accounts and transfer money, tools like Slack, Email, and Zoom that communicate with people at home or at work.

What are the components of an application? There is the UX, the interaction model for how the user makes use of the app. There is the business logic, the rules, that govern that interaction. Last, but definitely not least, there is the data. Data is the part that makes the application relevant to the user.

This article focuses on how data has changed and evolved, and why that evolution requires a modern approach to data infrastructure.

In the early days, there was only the data inputted by the user. With the advent of continuous connectivity, applications moved off our personal computers to cloud services. This has allowed larger and more varied data sets to be incorporated into the experience. The data may be used to recommend, predict, incent, or surface opportunities that derive from insights and trends.

Running as a backend service in the cloud, and with such broad data access, provides an opportunity for new capabilities as well as a host of new challenges as applications become more data-intensive.

But what does “data-intensive” mean? In physics, intensity is a measure of power over the surface area over time. Similarly, data intensity is measured over a set of dimensions.

DIMENSION	DEFINITION	RANGE
Size of the working data set	Size of the data queried over	Low = GBs, High > 100 TBs
Ingestion Speed	SLA on how many rows ingested/sec	Low = 1k rows/sec, High > Mils rows/sec
Query Latency	SLA on how fast the query has to run	Low = Minutes, High = Milliseconds
Query Complexity	How many joins in the query	Low = 0 joins High > 5 joins
Concurrent Queries	Numbers of users or queries running concurrently	Low = <5, High = >100s

Applications that have high values in two more of these dimensions or medium values in several of these dimensions are data-intensive.

There are many examples of data-intensive applications made possible by this shift in data availability and how data is used.

Stock trading applications are illustrative. These operations were possible in the past only by visiting a stock brokerage or trading company in person. Today’s applications not only access a user's account information but also a variety of information about the market and portfolios - they can even provide predictive what-if scenarios.

Digital Marketing has changed the advertising world significantly. With the ability to run many concurrent marketing campaigns, there is no end to the ideas you can test out. With access to rich demographic information, you can narrow your target segment and test specific messages and visuals. Digital marketing applications process the results and display them in a way so you can easily see what is working and what is not.

Where Data Comes From and How it is Used

To see the importance of data we need to understand where it comes from and how it is used by the application.

There are several ways data comes into existence:

User Data: Data entered by the user or on behalf of the user
3rd Party data: Data acquired to enrich the manual data. Typically loaded into the backend independently of the user application code.
Telemetry Data: Data generated about the usage of the app. This data is captured as events created by the application and stored in the same system for later use.
Aggregate Data: Aggregations over the other types of data. Aggregations can be sums, averages, or more complicated aggregate functions.

There are also many ways that data is used within an application:

Lookups: Lookups are about getting a small piece of data out of the system quickly. There is typically an identifier (a name, an id, email, etc…) and the information is looked up with that id. Operational Databases and NoSQL systems are pretty good at this.
Selective Queries: Selective queries are important to help you answer questions quickly and easily. This is where SQL (and relational algebra) are useful in making it easy to express your question, and where NoSQL systems often get stuck. (For more details on the limits of NoSQL read this blog post). Some examples of selective queries are:
- Who are the top salespeople in the organization?
- Who are the top players of a Fortnite game?
- Which users are experiencing a poor streaming experience?
Aggregations: Aggregations are typically done in separate analytic systems but are increasingly included directly in applications as they have access to a wider set of data and expectations that it be in line with the application user experience. Operational Databases and most NoSQL systems (though there are a few that specialize in just this type of query) are typically not very good at this, especially as the data size scales. Data Warehouses do this well but don’t do well powering applications. Example aggregation queries are:
- What is the average sales price over the last 12 months?
- What was my overall return on my portfolio?
Full-Text Query: Fuzzy searches, typically over unstructured data, look for approximate matches on exact words or words of similar semantic meaning. There are a few NoSQL systems that specialize in this and some of the operational databases support it as well (but usually not at scale).

Two Common Use Cases for Application Data Use

Uber

To the user, Uber doesn’t look like it has much data. In the application, the only data visible is user data such as name, phone number, favorites, and credit card). But on the backend, it actually uses all the other data types. It keeps track of all the users asking for cars (telemetry data), as well as all the drivers and their locations, current demand, and associated pricing (aggregate data). It also has maps of the roads, rules for different cities, what events are happening, current road conditions, and more (3rd party data). Users look up locations by name (full-text query) or they click on a favorite (lookup). It can also suggest locations from recent trips (selective query). The app has to process all this data in real-time to ensure a positive experience for the customer and ensure Uber and the drivers are making a profit at the same time. People think Uber is a car driving business, but it’s not. Uber is a logistics company, far more similar to UPS, FedEx, and Amazon than a taxi company. And they do it all by acquiring, aggregating, summarizing, and leveraging data.

Banking

Banking apps have evolved significantly since the early days of online banking. In the beginning, you could go to a website on your computer and see a limited number of your recent transactions from several days ago. Users checked the data infrequently (i.e. once a week). Now, everyone has an app on their phone and regularly checks account status (user data), deposits checks, and transfers money in and out of accounts. Credit card purchases are expected to update in real-time. We can also get summaries of spending, broken down by time, category and do comparisons over different time periods (aggregations). We can search for any vendor we have paid (Full-Text Search) or by the amount paid (Lookup) The app also does analysis and points out potential duplicates, changes in spending patterns, or increases in subscriptions. It pulls in data about the rest of the market (3rd party data) and lets me compare how I am doing relative to the market or others (Aggregation).

Data-Intensive Application Requirements

Building apps that make use of data in all these various ways is hard and comes with a lot of operational complexity. Following are some of the key modern application requirements that developers need to consider and conquer:

Consistency (ACID): Maintaining consistency in an application is hard. Application developers find it much easier to depend on the data infrastructure to guarantee the data is consistent, rather than have to do the checks in their application. If you are using an Operational Database you are covered, but NoSQL and DW systems typically don’t do this well.

High Availability: When offering a SaaS product the provider is responsible for the availability of service level agreements (SLAs). Keeping a system running in the face of any type of error such as hardware, software, external environment events (i.e. hurricane takes out a data center) is really hard. But customers have come to expect 24x7 access, no matter what happens.

Dealing with the volume of data and all the different data types: Dealing with so many types of data is another big challenge - maybe one of the most significant. The considerations are complex and considerable

How do you ingest data in a way that meets your requirements?
How do you transform it into a shape usable by your application?
How do you guarantee the quality and consistency of the data?
Which formats are the data in (JSON, CSV, Avro, Parquet, etc…) and how do you parse it?
How much data is coming in and how is that rate growing over time and what is it a function of?
What is the SLA on how fast data has to come in and be available?
What form do you need to store the data (relational, semi-structured, spatial, time-series, full text)?

Solving all these requirements is beyond challenging. Even once you have a model that is working, growth in your business often causes bottlenecks and missed SLAs when the data infrastructure can’t handle the load. This is not a problem you want to have. You want a data infrastructure that supports the data sources, formats, and ingest performance you require and one that will grow as your application usage grows.

Scaling with the data as you grow: Growth can happen in several dimensions (See this blog for more on scaling). You can have growth in the number of users, the amount of data per user, the rate of data ingestion, or the amount of queries per user. Additionally, growth in the different dimensions is not mutually exclusive. They often build on one another. To handle this growth your data infrastructure should be a distributed system whose compute and storage resources can be easily scaled (preferably in an online way) to handle the growth.

Security, Privacy, and Data Ownership: When you run a SaaS service you take responsibility for customer’s data. This brings with it security challenges and privacy issues. It also brings new possibilities. You can make use of that data or directly monetize it. Navigating these choices is tricky and the data infrastructure must have the right capabilities to handle all the associated security and privacy requirements.

Semantic understanding of data: As data moves through the different systems and is transformed and aggregated it can be hard to track the semantic meaning of the data. This causes challenges for users of the data downstream. If the data infrastructure understands the schema and semantics of your data, it becomes discoverable through APIs and is tracked when it changes. This makes it much easier to manage as things evolve over time.

Timeliness: The Need for real-time and instant access changes what data we can expect and when we expect it. For example, when we swipe a credit card in a grocery store, the transaction should immediately appear in the banking app. When a flight is delayed a notification should instantly appear on the user’s mobile device. If the application can’t deliver this information in real-time, the impact on user experience, application adoption, customer satisfaction, and ultimately, business revenue and success, is enormous.

Machine Learning (ML): ML is one of the exciting breakthroughs that resulted from access to large amounts of information and large amounts of computational power. It is all about figuring out things that would have been impossible to do by hand. Fraud detection and personalization are just a few examples of the many ways to utilize ML. There are a lot of tools for identifying and training models. But operationalizing those models is challenging as most of the toolsets fall short on running the models.

The Modern Approach to Data Infrastructure

Modern applications are data-intensive because they make use of a breadth of data in more intricate ways than anything we have seen before. They combine data about you, about your environment, about your usage, and use that to predict what you need to know. They can even take action on your behalf. This is made possible because of the data made available to the app, and data infrastructure that can process the data fast enough to make use of it. Analytics that used to be done in separate applications (like Excel or Tableau) are getting embedded into the application itself. This means less work for the user to discover the key insight or no work as the insight is identified by the application and simply presented to the user. This makes it easier for the user to act on the data as they go about accomplishing their tasks.

To deliver this kind of application you might think you need an array of specialized data storage systems, ones that specialize in different kinds of data. But data infrastructure sprawl brings with it a host of problems. See this blog for a walkthrough of the problems that pattern causes.

What’s required is a database that is consistent, highly available, durable, and resilient.

It needs to scale with you as your usage grows without forcing a costly re-architecture at every stage.
It should allow you to secure your data and meet all your privacy requirements.
It needs to handle the loading of data fast enough to meet your SLAs.
It should support natively loading the data from all standard formats available.
It must be capable of delivering all the analytics your application needs with no noticeable lag, no matter how busy the system is.
It shouldn’t force you to make copies of the data and move them around to various systems.

This is, admittedly, a tall order. But SingleStore, our fast unified database for data-intensive applications on any data, anywhere is changing the way application developers are powering their modern applications.

If you want to learn more about database systems that are capable of handling these requirements, check out our free trial. We’d love to power up your next data-intensive application.