What Are Data-Intensive Applications?
Data Intensity

What Are Data-Intensive Applications?

This article defines data-intensive applications in more detail, including key requirements, use cases and how you can evaluate your application's data intensity. Table of Contents What Are Data-Intensive Applications?Why Do You Need Data-Intensive Applications?What Are the Key Requirements for Data-Intensive Applications?High Concurrency in Data AccessFast-Changing Data StreamsSuper Low LatencyLarge Datasets With Quick Ingestion of DataFast AnalyticsScaling and Analytics in Data-Intensive Applications What Are Data-Intensive Applications? Data, not logic, is at the forefront of application development. In the software industry, it was previously thought that an application’s functionality was contained in its logic and dispersed throughout the code, but it’s since become clear that application logic is largely determined by the state of business data. The state of an order, for example, determines what can and cannot happen in the business process associated with that order; only after an order has been dispatched can it be shipped. These state transitions in business data can be chained together to build a long-running transaction connected with that business object. As a result of this shift in perspective, data has evolved from being something static that aids logic to something that defines an application’s business logic. Such logic is based on both historical and current data. Today’s data-intensive applications are designed to manage terabytes of data from millions of customers, and thorough analysis of data related to user behavior and business performance can be used to determine future business strategies. This article will define data-intensive applications in more detail so you can see how they benefit users. Why Do You Need Data-Intensive Applications? Organizations can use data-intensive applications in multiple ways. Before the cloud computing era, previous orders in an order management system could be used to anticipate the projected order volume for the coming months. The much larger volumes of data available in a modern cloud-based system can increase an organization’s analytical capabilities exponentially. For example, ridesharing apps like Uber rely on real-time data to find available cabs near a user’s location and calculate estimated fares. Other examples of data-intensive applications include social media platforms like Facebook and Twitter; payment service providers like PayU and PayPal; mobile banking applications; video streaming services like Netflix and Hulu; and eCommerce applications like Flipkart and eBay. What Are the Key Requirements for Data-Intensive Applications? There are many factors that you should consider during the architecting and design of a data-intensive application. The following are the most important. High Concurrency in Data Access Since data-intensive applications have a large number of consumers generating huge amounts of data, you need to ensure high concurrency for data access. For example, when a user opens the Uber app to request and book a ride, the app finds available cabs near the user’s location and displays the computed fares for the ride. In 2021, Uber had approximately 3.5 million drivers and completed approximately 18 million trips per day. As of 2020, Uber is present in more than 10,000 cities. If you equally distribute those numbers across the cities, there are approximately 350 drivers competing for 1,800 bookings per day. This makes concurrent access to the same cabs high. Fast-Changing Data Streams Ultra-fast-changing data streams need to be managed effectively in data-intensive applications. For example, say a user is making a reservation on the IRCTC (Indian Railways) website. The user searches for a train, checks seat availability and books the seat. This is a heavily used site, though. Indian Railways carries over 24 million passengers a day, who prebook and travel in about 13,452 trains. At such high volumes, seat availability data changes almost instantly. Typically, within seconds of opening the online bookings for a train, all seats are booked. Applications with a data stream of this speed must be able to keep up with it. Super Low Latency Low latency is important to enable instant data access and updates. Say a customer uses their debit card at an ATM to check their balance, withdraw cash or deposit money. Their account needs to be updated immediately, and the customer should get an instant notification. HDFC Bank, one of India’s largest banks, owns over 18,000 ATMs. Even if more than half of those ATMs are in use simultaneously, the concurrency of access to the system is around 9,000. As of 2022, HDFC Bank has over 68 million customers. Its online banking application and ATMs need to access the same account information, which increases the concurrent users of the system. While the concurrent users are high, data concurrency is low, because each customer is accessing their own account. Large Data Sets with Quick Ingestion of Data Data-intensive applications like Uber and IRCTC don’t need to consider the rate of ingestion of data, but this is an important consideration for streaming apps. For example, video streaming service Netflix has 222 million subscribers in more than 190 countries. Users generally watch videos in high definition, and the data needs to be streamed continuously without gaps so that there is no buffering. According to Android Authority, Netflix’s data consumption is at 6.5 GB to 11.5 GB per hour for 4K resolution. Fast Analytics You also need to architect for fast real-time analytics on large data sets. Netflix, for example, supports more than 2,000 devices for streaming. Each of these devices has varied support for video quality, audio quality, resolutions, and supported formats. Netflix needs to be able to transcode and encode the original video streams appropriately for all of these devices. Meanwhile, to offer cabs to a user, Uber needs to apply parameters that minimize wait time, reduce extra driving, and improve the overall ETA. This requires Uber to apply real-time analytics on a large driver database with current location, driver behavior, and other parameters. Scaling and Analytics in Data-Intensive Applications Vertical scaling works well for a client/server application with a few users, but that single server can only manage a certain number of concurrent users. Data-intensive applications use horizontal scaling; they’re written as stateless applications, allowing for load balancing across different servers Some functionalities in an application aren’t used as much as others. For example, actions such as updating a train’s status or adding a timetable to the IRCTC site are performed less frequently than actions like booking a train seat or checking a train’s status. If all of these functions were scaled evenly because they were hosted on the same cluster, the performance of less-used functions would be influenced by the performance of more-used ones. This issue is avoided with the microservices architecture, in which functions are broken down into smaller chunks that can be deployed and scaled independently. Microservices, clustering, and all other scaling strategies are focused on the application’s processing side. To get around vertical scaling on the data side, microservices architecture alternatives involve either connecting each service to the same database or storing data for each microservice in a single database. In order to prevent a database from becoming a bottleneck, it needs to scale intrinsically, similarly to the microservice structure, with regional and functional deployments of the microservices. A database that can partition data in a single table into numerous co-located nodes can be accessed easily from a server in a specific location without having to manually deploy a new database for each region. Regionally distinct deployments are ineffective, though, because they discourage data consolidation. Typical systems attempt to deploy separate regional databases with background activities that combine data into a central database. This causes a significant delay in data availability for data-intensive applications, and database administration becomes more difficult. The best solution allows the database to scale horizontally by distributing data between various servers. Analytics is another major use case for such applications. Analyzing the entire data set is preferable to doing it piecemeal. Real-time analytics requires a localized data set, while offline analytics needs the entire data set. However, real-time analytics in one region cannot benefit from the learnings of another. Uber, for example, must do real-time analysis of cab statistics in order to evenly distribute assignments to cab drivers. Otherwise, it will be perceived as biased in favor of a few cab drivers. In this use case, analytics that are run just on the regional database will suffice. A distributed database that works well in both regional and consolidated use cases is needed for these situations. Conclusion Data-intensive applications are those in which the amount of data that needs to be managed grows exponentially with the number of users. Such applications can handle increasingly complex tasks to serve users and offer deeper analysis for organizations in multiple industries. The needs of data-intensive applications, such as super low latency of data access, ultra-fast changing data streams and high concurrency, can best be met by distributed databases. For example, SingleStoreDB offers a real-time, distributed SQL database for multi-cloud, on-premise and hybrid systems. It provides fast ingestion, nearly unlimited scalability and sub-second latency. And with a unified data engine for transactional and analytical workloads, SingleStoreDB powers fast, real-time analytics for data-intensive applications. Wondering how data-intensive your applications are? Find out with our Data-Intensity Calculator Try SingleStoreDB free.
Read Post
How to Evaluate Your Application's Data Intensity with SingleStore's Data Intensity Assessment Calculator
Data Intensity

How to Evaluate Your Application's Data Intensity with SingleStore's Data Intensity Assessment Calculator

Modern applications live in the cloud, and access and generate large amounts of data. This data needs to be aggregated, summarized and processed — and presented to users in a way that is understandable, interactive and served up in real time. As applications evolve to meet these new requirements, they are using data more intensely than ever before.  If we can break through the bottlenecks of existing technology, we start to see how we can better interact with our customers, partners and employees. One of the biggest challenges in the infrastructure when it comes to data is the database — building a data-intensive application requires a database that can handle the intensity. What Is Data intensity? Most people have an intuitive understanding of what this means.  But how do you measure intensity? In physics, the definition is “Power over Surface area over Time.” But data intensity is also made up of several dimensions, and it is high requirements among these dimensions that make an application data intensive.  To help organizations better understand how data intensive their applications are (and therefore what kind of data infrastructure they need), SingleStore introduced the Data Intensity Assessment Calculator, which is derived from five key dimensions: Query latency: Query latency is the amount of time it takes a database to execute a query and receive a result. Data-intensive applications often have strict SLAs for query latency.Concurrency: Data-intensive applications often need to support a large number of users or concurrent queries without sacrificing the SLA on query latency.Query complexity: Data-intensive applications must be able to handle both simple and complex queries.Data size: Modern databases must be able to effortlessly operate over large data sets.Data ingest speed: Data-intensive applications must be able to load – or ingest – data at very high rates, from thousands to millions of rows per second. Using these five variables, the Data Intensity Assessment calculates the data intensity of your application and provides an assessment of what kind of data infrastructure is needed to deliver on the SLAs and user experience your application requires. How Data Intensive Are Your Applications? Take this three-minute assessment to find out. Your ideal database might be just around the corner!
Read Post
Forrester
SingleStore Recognized In

The Forrester WaveTM

Translytical Data
Platforms Q4 2022

Scale Your Speed
Data Intensity

Scale Your Speed

We often meet customers struggling to derive insights from their data in real time, at any time. And while several databases are designed to support big data, few are designed to support fast data — and we mean ingesting trillions of lines per second, while querying that data in milliseconds. The challenge in delivering today’s high demand for data is that most current databases weren’t designed for data intensity — the idea that high volumes of data should be ingested and processed at tremendous speeds without lags in availability, no matter how complex or frequent the data sets. Now, try handling both transactional and analytical workloads, and you’re left with only a few databases that can stand up to these requirements.  But in markets driven by digital interactions and instant everything, simply being fast won’t cut it. The world’s leading brands demand speed at scale — whether they’re preventing millions of fraudulent credit card transactions, offering dynamic pricing to hundreds of thousands of riders or sending “watch next” recommendations to millions of viewers — it’s not enough to be fast. You also have to go big. Scale out comes in many forms — but for workloads that demand high performance (and even higher security), customers like Akamai, Uber and Hulu choose highly performant, on-premises platforms to augment their cloud deployments. SingleStore uses massive parallel processing (MPP) to scale out database architectures, adding nodes as workload demands increase.  To highlight scale-out performance potential, SingleStore recently completed TPC benchmarking studies with Dell Technologies, using some of their most powerful compute and storage solutions including PowerEdge 940, PowerEdge 740 and PowerFlex. Dell’s white paper, “Unleash the Power of Real-Time Data Intensive Applications” dives into how Dell Technologies platforms work seamlessly with SingleStore’s MPP architecture, scaling to thousands of clusters with high availability in multiple geo locations. From emerging data challenges for today’s organizations, to optimizing business outcomes with SingleStore on PowerFlex storage, Dell’s white paper dives deeper into: How SingleStore accelerates data ingest, reduces query latency and increases concurrencyThe next era of cloud database management systems, and how SingleStore and Dell add greater resiliency into your database architectureHow Dell and SingleStore provide a modern IT foundation for data-intensive applications “The single biggest challenge that many companies face is the ability to handle the ever- growing data and the ability to draw insights. They often do that with error prone, complex technical architectures. SingleStore provides a compelling solution for such data-intensive applications on all key dimensions of scale, performance, HA and TCO. We are thrilled to work with Dell PowerEdge and PowerFlex in bringing the power of our solution on top of their leading virtualization platform.” — Shireesh Thota, SVP Engineering at SingleStore Download Dell’s white paper, “Unleash the Power of Real-Time Data Intensive Applications” today. Get started with SingleStore  Ultra-high performance, accelerated query ingest and elastic scalability starts here. Try SingleStore free today.
Read Post
Are My Applications Data Intensive?
Product

Are My Applications Data Intensive?

SingleStore is a hybrid relational database geared toward data-intensive applications that runs on-prem, or in a private or public cloud. What is a data-intensive application, and why use SingleStore even if I don’t think I have one? Unfortunately, the ‘why’ becomes difficult because several database vendors lean into the idea of data intensity — and messaging around the fastest databases, best performance and more become even more complicated when analytic and reporting tool vendors add a data front end in their offering. This blog post explores what a data-intensive application means, what makes SingleStore unique and why you should start using SingleStore today. What Is Data Intensity? When you hear the phrase ‘data intensive,’ your first thoughts might be “what does data intensive mean”? Or even, “I don’t think I have a data-intensive application.” The truth is, a lot of your existing applications should be data intensive — but limitations of database and hardware technology have guided architectural decisions every step of the way. Back in the ‘80s, I worked on a server with a 50MB drive. We had to severely limit what information we wanted to capture. When I started in data warehousing, the company I worked for specialized in successfully building multi-TB, customer-centric data warehouses. Although this was an unusual feat in 1998, what made us successful was our architectural decisions not only on what we could include, but also what we should exclude from the data warehouse. Updates were run in a weekend batch, with hopes it would be ready for Monday morning. That logic is still being applied today! We know that data comes in real time to most of our record keeping systems and applications, but we still use the same approach to our architecture. We limit our designs around bottlenecks. We don’t bring in too much data so our load process doesn't choke. We report on stale data since our processes are batched to our analytics and reporting databases. We limit access so as not to overload the database. We archive data because drives fill up and are expensive. We create aggregate tables and cubes in batches at fixed periods of time, because reporting on large volumes of data is just too slow. We extract, transform and load data multiple times going from our source systems to operational data stores, to data warehouses to data marts, to cubes to data lakes — and then we flatten our data. We have created this complicated architecture to overcome speed limitations with our storage and retrieval process. I worked for a large online retailer and assisted on a project a few years ago for hourly reporting. We ended up having a completely separate process for the hourly reporting, since our data came in nightly batch runs. Not only that, but it took 10 - 15 minutes into each hour before you could see the previous hour’s data. And, users could only see the results for half the company.Even if you’re able to work around these limitations, our need for data-intensive applications is in our future. Five years ago it would have been inconceivable to track a UPS truck driving through a neighborhood. Now, I watch it right on my phone. The expectation for real-time access to the things that impact our day-to-day only continues to grow. And if we take out the limitations of technology, we start to see how we can better interact with our customers and suppliers — starting with a database that handles data-intensive applications, and sets the foundation for the future.Get the C-Suite’s Guide for Data-Intensive ApplicationsWe can start now, phasing in applications that are limited by current designs. We can augment our existing application where data-intensive apps demand it — building new applications, and modernizing what we have. This allows us to seamlessly move into the next generation that is data intensive. It also removes the need and associated costs of migrating later, handling everything under pressure to choose a database, programming language and deployment environment. The other consideration for SingleStore implementations is to have a lower total cost of ownership (TCO) on less intensive applications — allowing budget to be redirected not only from the database, but also the cost of the infrastructure (servers) and to better utilize your organization’s manpower.Recognizing Data-Intensive Applications: 5 Key Criteria
Read Post
7 Keys to Delivering a Seamless Streaming Experience for This Year’s Big Game
Data Intensity

7 Keys to Delivering a Seamless Streaming Experience for This Year’s Big Game

This year’s Big Game will be one of TV’s most viewed programs. And it’s quickly becoming one of the most widely streamed media events. But what does it take to create a seamless streaming experience for an event like the Big Game? After all, such an event requires a company to manage a massive volume of data.7 Keys to Delivering a Seamless Streaming Experience for This Year’s Big Game This year’s Big Game will be one of TV’s most viewed programs. And it’s quickly becoming one of the most widely streamed media events.In 2021, the Big Game attracted 5.7 million livestream viewers. That’s the highest average minute audience for any National Football League (NFL) game ever. Last year’s game was also the first in NFL history with more than 1 billion streaming minutes.But what does it take to create a seamless streaming experience for an event like the Big Game? After all, such an event requires a company to manage a massive volume of data.And an event like this is just one example of a use case with a data-intensive workload that requires data infrastructure fit to handle complex, real-time data and analytics. Without such technology, streaming companies can expect livestream outages and poor user experiences.First, let’s get our arms around the playing field we’re facing. Then, we’ll tackle the conversation about what streaming companies can do to pull off functions of this magnitude.Streaming Is Where the Action Is Cable TV used to be the center of the action. Now internet streaming is where it’s at.Netflix became the reigning champion of streaming services as consumers shifted from cable TV to the internet. Amazon Prime has continued to expand its subscriber numbers and roster of live sports content. And Hulu is recognized for its huge variety of familiar shows from networks and growing catalog of its own critically acclaimed original series like The Handmaid’s Tale.The success of these companies has led to an explosion of providers entering the streaming space with their own content libraries. Disney+ launched in November 2019, premiering the  “Hamilton” movie the following year to reel in paying customers for the business – a winning strategy. Since then the world has also seen NBC release Peacock, HBO introduce HBO Max and Discovery Inc. (Discovery+) and other businesses jump into the ever more crowded streaming wars.Clearly, streaming is where content is going, and the pandemic only accelerated that movement. Time spent streaming grew 44% between the fourth quarters of 2019 and 2020, according to Conviva. And a recent Nielsen article indicated that in the last week of December 2021, audiences streamed 183 billion minutes — eclipsing the amount of time they spent streaming at the weekly height of COVID-driven lockdowns in early 2020 (166 billion minutes).Expectations and Requirements Have ChangedStreaming has led to new consumer expectations and technology trends. Consumers now expect personalized media. At the same time, there is a big consolidation of data infrastructure technologies because streaming media and some other workloads are data intensive, and real-time analytics are needed to enable streaming providers to act in the moments that matter.As users around the world huddle to access and enjoy digital content — like the Big Game — simultaneously, it can put a lot of stress on the underlying infrastructure. During these times, the number of users on a streaming provider’s systems can grow exponentially. If the streaming company is unable to deliver the kinds of experiences that consumers and other stakeholders, like advertisers and sports leagues, expect, it can result in lost viewers and revenue.To stream events to large numbers of simultaneous viewers, companies need fast and elastic data infrastructure that is capable of delivering quality content experiences in real time.Effective Data Management Is of Paramount ImportanceSeamlessly pulling off an event like the Big Game requires modern data management tools.With the right data management, organizations can solve for data-intensive workloads by ensuring real-time (or close to it) data ingestion and low query latency, and the ability to handle complex queries and high concurrency. A modern database does that by unifying, simplifying and reducing the cost of data infrastructure — so providers can spend less and earn more.This is the right approach for today and tomorrow. With the right data management, providers can maintain a robust and evolving understanding of audience preferences to create lucrative revenue streams from advertiser-supported streaming. This is particularly beneficial for an event like the Big Game. A modern database also enables streaming providers to prepare for the next wave of personalization – delivering streaming products directly to consumers.We — and Our Customers — Know This From ExperienceOur customers Comcast and Hulu provide great examples of the challenges that streaming providers face — and how they can address them and expand their opportunities in the process.Comcast needed a data platform to proactively diagnose potential issues and deliver the best possible video experience for viewers. With SingleStore, Comcast now has the power to get both viewership and infrastructure monitoring metrics in real time. And Comcast’s streaming analytics drive proactive care and real-time recommendations for 300,000 events per second.Before adopting SingleStore, Hulu struggled to maintain its massive data footprint. It was doing continuous manual maintenance, dedicating too much of its developers’ attention to maintenance and facing high costs. As a result, the user experience suffered. The issue reached epic proportions when Hulu faced massive outages during the 2018 and 2020 Super Bowls.Hulu knew it needed to make a change. So it set out to simplify its operations, deliver high performance at scale and decrease its data footprint. And it called on SingleStore to enable that change. After implementing our modern database, Hulu reported a more than 50% reduction in its infrastructure, a massive performance boost and decreasing costs. SingleStore also provided Hulu with the ability to create a “system of insight” so the streaming giant can identify, predict and forecast problems and move quickly to address them. Hulu relies on quality of service metrics — collected from various user platforms like browsers, mobile apps and streaming devices like Fire Stick and Roku — to provide user experience information across platforms and service providers. Since switching to SingleStore, Hulu has been able to access this data in a timely manner, which has been critical to this industry leading company’s business success.By using the right data management tools, your business can also benefit by:Decreasing your data footprintDelivering high performance at scaleEmploying analytics to drive proactive care and real-time recommendations at scaleIdentifying, predicting and addressing problems quickly as they ariseMonitoring metrics in real timeProviding user experience information across platforms and service providers, andSimplifying your operationsIf you have questions about how SingleStore helps address data intensive application or streaming requirements, ask them on our Forum. Development experts and engineers from SingleStore, as well as members of our user community are always happy to help out.You can also follow us on Twitter.
Read Post
You're Using Too Many Databases. You Only Need One
Data Intensity

You're Using Too Many Databases. You Only Need One

Data infrastructure complexity is rampant in our industry, with modern applications continuing to be built on top of multiple, special-purpose databases, with even more being added throughout product’s and organization’s lifespans. Peruse GitHub and you’ll typically find the following database mix for a web application: A traditional relational database, like MySQL, MariaDB or PostgreSQL for storing and retrieving content. Sometimes, that can be a NoSQL database, such as MongoDB or Cassandra (although, you really shouldn’t do that).A faster, often in-memory, key-value database, such as Redis or Memcached, for content caching or tracking state of high-speed background job queue.A specialist database for a particular use-case or feature, like Elastic for full-text search, or a time series database.A Data Warehouse for pulling data back together and running analytics. This is a common practice beyond self-hostable web applications, and Cloud Service Providers seem to be incentivizing this behavior. We’ve already called to stop this insanity, which we dubbed the “data infrastructure sprawl”. Well, I’m here to tell you that you don’t need to use multiple databases of your web apps. You only need one database: SingleStore. The advantages of using multiple databases Let’s assume you start your project with a single database. There are many ways in which you can reach its limits: You start spinning up background jobs, and all of a sudden you run out of connections, causing your team to scramble to control concurrency limits.You need to run some complex joins to provide some analytical data in your app, which your database takes too long to run, slowing down your response times and disappointing your customers. Again, your team scrambles to work around the issue.You need to format data in a specific way, and you find yourself tinkering with app logic to circumvent the feature-set your database was meant to offer. And again, your team is working around your database limitations. Adding new databases seems like a sensible approach: Right tool for the job: Your initial database can’t do all the things, so you extract part of your app to rely on another database that is good for a specific task. This solves your performance problems.It’s not that hard to set up: Frameworks and libraries are widely available for a number of different databases. Your dev team can just run `docker-compose up` to spin up multiple services in development, and you can use DBaaS (Database as a Service) offerings to easily run different databases in production. Job done.It’s a well-accepted common practice: Like a self-fulfilling prophecy, the more apps do this, the more apps keep doing this. So you find lots of tutorials, forum entries, documentation, libraries, etc., making it easier for you to follow this pattern. Separation of concerns can indeed apply to the database layer. Multiple database technologies can be used with monolithic applications, and can even seem more natural in a microservices environment, where each service would have its own database. This approach, however, is not bulletproof. Far from it, actually. In the next section, we’ll explore some of the common problems that arise when you choose to use multiple database technologies to work around the limitations of your main, or initial, database. The downsides of using multiple databases When designing and building software, you often find yourself managing trade-offs. In the previous section, we listed some of the advantages that come with using multiple databases to back your application. It’s not all fun and games, though. You’ll find that the number of issues and potential problems does not grow linearly with the number of database services you add to your infrastructure; in fact, you get an exponential number of headaches: Not just a steeper learning curve: There are now multiple learning curves. Developers aren’t well versed in the intricacies of each database technology, by default. When you’re onboarding new team members, or when your team members need to touch a new part of the codebase, the multiplicity of database technologies can very well slow down their learning or result in technical debt, if the new code they write doesn’t leverage the database properly. If you added a new database service to speed up your development velocity initially, you may find later on that you’ve actually only punted the problem.More points of failure! You now need to implement disaster recovery strategies that cover multiple database technologies, each having their own way of allowing you to manage data backup and restoration. You’ll also need to monitor more services, each of them having different relevant metrics. Not only that, you also need to consider how your application’s features that rely on different databases coordinate with each other, so that you have a good grasp of how your users, and your team, will deal with failures in a given database, or several of them. Imagine that your app’s search bar stops working because the specific database that backs the search feature is down -- how does your application behave then? Or if your background jobs are suddenly lost because the database that manages the queues ran in-memory and the server crashed -- will your app degrade gracefully, or will problems start piling up for your users and your team and result in an entangled mess? Orchestrating additional points of failure is hard and will take time and resources away from building and polishing your application.Increased business costs. We’ve seen how it can be more expensive for your team to continue to grow your application, but you also need to consider the multiplication of direct business costs. Each database vendor will price their Cloud offerings, self-hosted licenses and support services differently, each of course charging healthy margins on their services. As your application grows, you’ll find yourself hitting different tier limits with the different vendors. You’ll often find yourself over-provisioning not for one, but for several services, paying more than you need several times over. Using multiple database technologies is a costly affair. It will take time away from your team, slowing down onboarding of new members and development of new features over time. It will add complexity to your architecture and to operations management, making it harder to maintain your quality of service. And it will cost your business more dollars as your application grows. This tradeoff between capability, performance and cost is accepted by most organizations, as there have been no real alternatives to achieving the right performance and capabilities for modern, data-intensive applications given the limitations of available database technologies. But we believe that there’s a far better way to achieve your goals without taking on the problems described above. Just use SingleStore Over the years, the SingleStore database has been used to replace a variety of best-in-class database technologies by hundreds of companies. It turns out that for an overwhelming majority of cases, you don’t need an entirely different database technology to overcome specific limitations of your first, or main, database; you’d only need that database to have a few more features, and for it to be faster across the board. A customer in the Cybersecurity space has recently shared with us that they replaced hundreds of instances of two different database technologies, Elastic and Postgres, with a single cluster running a couple dozen units of SingleStore. This represents not only massive dollar savings, but tremendous simplification in maintainability, feature development and team growth. Another customer, Fathom Analytics, a privacy-first web analytics service, started out with SingleStore when their cloud-based MySQL deployment started to produce slow dashboards for their fast growing user base. A few months in, they’re already looking to replace other database technologies with SingleStore to simplify their architecture. Theirs is a voluntarily small, bootstrapped company. If you read through their technical blog posts, you’ll see that their goal is to provide the best and fastest analytics service in the world, while staying small. SingleStore enables this at the database level. We have a number of case studies available on our website describing how customers simplified their architecture by replacing at least two databases with just SingleStore: Case Study: How Fanatics powered their way to a better futureCase Study: How GoGuardian stores and queries high throughput data with SingleStoreCase Study: Insite360 Uses SingleStore Pipelines to Deliver IoT in the Cloud There’s no end to the benefits.  Fast and familiar. SingleStore is an incredibly fast relational database that can support both transactional and analytical queries at any scale, leveraging in-memory and on-disk storage capabilities. In addition, it also supports a variety of data formats, ranging from JSON to geospatial to time series data. It is MySQL-wire compatible, making it a drop-in replacement for MySQL and MariaDB: you don’t need to rewrite code, retool or learn an entirely new technology! Scalable and versatile. When you start your project with SingleStore, you’ll be able to scale your usage forever: not only will the database remain fast for your main use case at all sizes, you will also be able to leverage its versatility to build new use cases without needing to deploy additional, specialized, database technologies. You’ll rationalize your architecture.Your team will always be in familiar ground.You’ll save money.You can rest assured that your application won’t find a bottleneck at the database layer. That’s why companies of all sizes, from startups to unicorns and Fortune 10 companies have already adopted SingleStore. You can try SingleStore for free in the Cloud or deploy it yourself. Give it a shot today. I promise you won’t regret it. Create your account now!
Read Post
Real-time Machine Learning in SingleStore with MindsDB
Data Intensity

Real-time Machine Learning in SingleStore with MindsDB

SingleStore and MindsDB recently announced a partnership to advance machine learning innovation and provide machine learning features within the SingleStore database. SinglesStore offers a number of features that enhance training performance for machine learning, in addition to supporting real-time inference. This blog post from MindsDB VP of Business Development, Erik Bovee, outlines this exciting integration. In previous posts we have explored machine learning at the data layer, machine learning on data streams, and handling difficult machine learning problems such as large, multivariate time series and SingleStore offers a number of features and capabilities that enhance machine learning across all of the use cases we have explored before. SingleStore is a high performance database, originally designed for running in-memory, boasts extremely high-performance at scale, and is particularly well adapted both for on-line analytics and transactions. Column stats, Histograms, and Windowing functions Running MindsDB with SingleStore can significantly increase the performance and reduce the computational requirements of training your machine learning algorithms, which can be quite intensive depending on the size and type of data, and the size/characteristics of the ML model. For many machine learning applications, training is done on a subset of data, and not the entire data set. Machine learning engineers are often tasked with extracting statistically important subsets of the data, preparing, and transforming that data for training. SingleStore has two significant features that simplify these tasks and accelerate training: data sampling with automated stats, and windowing functions. SingleStore has statistical features (you can explore them more deeply here) including a feature called ‘autostats’ which gathers information used for query planning, and is also extremely useful for data sampling for machine learning. Autostats provides two types of information: Column stats, including information on the cardinality of a column (high cardinality can be a challenge in ML and is something that MindsDB handles well) andHistograms, or range statistics, which provide information on the distribution of data in a column. Information from column stats is useful for MindsDB in making the best, automated choice of ML model or mixer to train on the data, and a histogram very quickly provides a data sample and statistical information necessary for data preparation and transformation. With ‘autostats’ turned on in SingleStore, you can reduce the time MindsDB would take to prep data, you generate statistics that also contribute to faster training times, and higher quality trained models. Another useful feature for more efficient training is the SingleStore ‘Window’ function.
Read Post
Data-Intensive Applications Need A Modern Data Infrastructure
Data Intensity

Data-Intensive Applications Need A Modern Data Infrastructure

Gone are the days when applications were installed locally, had a handful of users at any one time, and only focused on basic data entry and retrieval. Modern applications live in the cloud and access and generate large amounts of data. This data needs to be aggregated, summarized, and processed - and presented to the user in a way that is understandable, interactive, and served up in real-time. From the user perspective, a positive experience depends on data being highly available, consistent and secure - without compromising performance. To meet these needs, modern, data-intensive applications need a modern data infrastructure that can scale seamlessly as user numbers grow. Applications, Defined By applications, we really mean services (as in Software-as-a-Service) as nearly all new modern applications are now being built as services. With the move to microservices architecture, the boundary around what is an “application” becomes somewhat fuzzy. For purposes of this article, it is everything that goes into delivering the end-user experience. This includes the UX as well as all the backend services that make that UX possible. Applications have existed since the beginnings of the computing era and, at their heart, allow a user to accomplish a task. For example, any smartphone has many apps, all for specific tasks. The Uber app to call a car, a banking app to check accounts and transfer money, tools like Slack, Email, and Zoom that communicate with people at home or at work. What are the components of an application? There is the UX, the interaction model for how the user makes use of the app. There is the business logic, the rules, that govern that interaction. Last, but definitely not least, there is the data. Data is the part that makes the application relevant to the user. This article focuses on how data has changed and evolved, and why that evolution requires a modern approach to data infrastructure. In the early days, there was only the data inputted by the user. With the advent of continuous connectivity, applications moved off our personal computers to cloud services. This has allowed larger and more varied data sets to be incorporated into the experience. The data may be used to recommend, predict, incent, or surface opportunities that derive from insights and trends. Running as a backend service in the cloud, and with such broad data access, provides an opportunity for new capabilities as well as a host of new challenges as applications become more data-intensive. But what does “data-intensive” mean? In physics, intensity is a measure of power over the surface area over time.  Similarly, data intensity is measured over a set of dimensions. DIMENSION DEFINITION RANGE Size of the working data set Size of the data queried over Low = GBs, High > 100 TBs Ingestion Speed SLA on how many rows ingested/sec Low = 1k rows/sec, High > Mils rows/sec Query Latency SLA on how fast the query has to run Low = Minutes, High = Milliseconds Query Complexity How many joins in the query Low = 0 joins High > 5 joins Concurrent Queries Numbers of users or queries running concurrently Low = <5, High = >100s Applications that have high values in two more of these dimensions or medium values in several of these dimensions are data-intensive. There are many examples of data-intensive applications made possible by this shift in data availability and how data is used. Stock trading applications are illustrative. These operations were possible in the past only by visiting a stock brokerage or trading company in person. Today’s applications not only access a user's account information but also a variety of information about the market and portfolios - they can even provide predictive what-if scenarios. Digital Marketing has changed the advertising world significantly. With the ability to run many concurrent marketing campaigns, there is no end to the ideas you can test out. With access to rich demographic information, you can narrow your target segment and test specific messages and visuals. Digital marketing applications process the results and display them in a way so you can easily see what is working and what is not. Where Data Comes From and How it is Used To see the importance of data we need to understand where it comes from and how it is used by the application. There are several ways data comes into existence: User Data: Data entered by the user or on behalf of the user3rd Party data: Data acquired to enrich the manual data. Typically loaded into the backend independently of the user application code.Telemetry Data: Data generated about the usage of the app. This data is captured as events created by the application and stored in the same system for later use.Aggregate Data: Aggregations over the other types of data.  Aggregations can be sums, averages, or more complicated aggregate functions. There are also many ways that data is used within an application: Lookups: Lookups are about getting a small piece of data out of the system quickly. There is typically an identifier (a name, an id, email, etc…) and the information is looked up with that id.  Operational Databases and NoSQL systems are pretty good at this. Selective Queries: Selective queries are important to help you answer questions quickly and easily. This is where SQL (and relational algebra) are useful in making it easy to express your question, and where NoSQL systems often get stuck. (For more details on the limits of NoSQL read this blog post). Some examples of selective queries are: Who are the top salespeople in the organization?Who are the top players of a Fortnite game?Which users are experiencing a poor streaming experience? Aggregations: Aggregations are typically done in separate analytic systems but are increasingly included directly in applications as they have access to a wider set of data and expectations that it be in line with the application user experience. Operational Databases and most NoSQL systems (though there are a few that specialize in just this type of query) are typically not very good at this, especially as the data size scales. Data Warehouses do this well but don’t do well powering applications. Example aggregation queries are: What is the average sales price over the last 12 months?What was my overall return on my portfolio? Full-Text Query: Fuzzy searches, typically over unstructured data, look for approximate matches on exact words or words of similar semantic meaning. There are a few NoSQL systems that specialize in this and some of the operational databases support it as well (but usually not at scale). Two Common Use Cases for Application Data Use Uber To the user, Uber doesn’t look like it has much data. In the application, the only data visible is user data such as name, phone number, favorites, and credit card). But on the backend, it actually uses all the other data types. It keeps track of all the users asking for cars (telemetry data), as well as all the drivers and their locations, current demand, and associated pricing (aggregate data). It also has maps of the roads, rules for different cities, what events are happening, current road conditions, and more (3rd party data). Users look  up locations by name (full-text query) or they click on a favorite (lookup). It can also suggest locations from recent trips (selective query). The app has to process all this data in real-time to ensure a positive experience for the customer and ensure Uber and the drivers are making a profit at the same time. People think Uber is a car driving business, but it’s not. Uber is a logistics company, far more similar to UPS, FedEx, and Amazon than a taxi company. And they do it all by acquiring, aggregating, summarizing, and leveraging data. Banking Banking apps have evolved significantly since the early days of online banking. In the beginning, you could go to a website on your computer and see a limited number of your recent transactions from several days ago. Users checked the data infrequently (i.e. once a week). Now, everyone has an app on their phone and regularly checks account status (user data), deposits checks, and transfers money in and out of accounts. Credit card purchases are expected to update in real-time. We can also get summaries of spending, broken down by time, category and do comparisons over different time periods (aggregations). We can search for any vendor we have paid (Full-Text Search) or by the amount paid (Lookup) The app also does analysis and points out potential duplicates, changes in spending patterns, or increases in subscriptions. It pulls in data about the rest of the market (3rd party data) and lets me compare how I am doing relative to the market or others (Aggregation). Data-Intensive Application Requirements Building apps that make use of data in all these various ways is hard and comes with a lot of operational complexity. Following are some of the key modern application requirements that developers need to consider and conquer: Consistency (ACID): Maintaining consistency in an application is hard. Application developers find it much easier to depend on the data infrastructure to guarantee the data is consistent, rather than have to do the checks in their application. If you are using an Operational Database you are covered, but NoSQL and DW systems typically don’t do this well. High Availability: When offering a SaaS product the provider is responsible for the availability of service level agreements (SLAs). Keeping a system running in the face of any type of error such as hardware, software, external environment events (i.e. hurricane takes out a data center) is really hard. But customers have come to expect 24x7 access, no matter what happens. Dealing with the volume of data and all the different data types: Dealing with so many types of data is another big challenge - maybe one of the most significant. The considerations are complex and considerable How do you ingest data in a way that meets your requirements?How do you transform it into a shape usable by your application?How do you guarantee the quality and consistency of the data?Which formats are the data in (JSON, CSV, Avro, Parquet, etc…) and how do you parse it?How much data is coming in and how is that rate growing over time and what is it a function of?What is the SLA on how fast data has to come in and be available?What form do you need to store the data (relational, semi-structured, spatial, time-series, full text)? Solving all these requirements is beyond challenging. Even once you have a model that is working, growth in your business often causes bottlenecks and missed SLAs when the data infrastructure can’t handle the load. This is not a problem you want to have. You want a data infrastructure that supports the data sources, formats, and ingest performance you require and one that will grow as your application usage grows. Scaling with the data as you grow: Growth can happen in several dimensions (See this blog for more on scaling). You can have growth in the number of users, the amount of data per user, the rate of data ingestion, or the amount of queries per user. Additionally, growth in the different dimensions is not mutually exclusive. They often build on one another. To handle this growth your data infrastructure should be a distributed system whose compute and storage resources can be easily scaled (preferably in an online way) to handle the growth. Security, Privacy, and Data Ownership: When you run a SaaS service you take responsibility for customer’s data. This brings with it security challenges and privacy issues. It also brings new possibilities. You can make use of that data or directly monetize it. Navigating these choices is tricky and the data infrastructure must have the right capabilities to handle all the associated security and privacy requirements. Semantic understanding of data: As data moves through the different systems and is transformed and aggregated it can be hard to track the semantic meaning of the data. This causes challenges for users of the data downstream. If the data infrastructure understands the schema and semantics of your data, it becomes discoverable through APIs and is tracked when it changes. This makes it much easier to manage as things evolve over time. Timeliness: The Need for real-time and instant access changes what data we can expect and when we expect it. For example, when we swipe a credit card in a grocery store, the transaction should immediately appear in the banking app. When a flight is delayed a notification should instantly appear on the user’s mobile device.  If the application can’t deliver this information in real-time, the impact on user experience, application adoption, customer satisfaction, and ultimately, business revenue and success, is enormous. Machine Learning (ML): ML is one of the exciting breakthroughs that resulted from access to large amounts of information and large amounts of computational power. It is all about figuring out things that would have been impossible to do by hand. Fraud detection and personalization are just a few examples of the many ways to utilize ML. There are a lot of tools for identifying and training models. But operationalizing those models is challenging as most of the toolsets fall short on running the models. The Modern Approach to Data Infrastructure  Modern applications are data-intensive because they make use of a breadth of data in more intricate ways than anything we have seen before. They combine data about you, about your environment, about your usage, and use that to predict what you need to know. They can even take action on your behalf. This is made possible because of the data made available to the app, and data infrastructure that can process the data fast enough to make use of it. Analytics that used to be done in separate applications (like Excel or Tableau) are getting embedded into the application itself. This means less work for the user to discover the key insight or no work as the insight is identified by the application and simply presented to the user. This makes it easier for the user to act on the data as they go about accomplishing their tasks. To deliver this kind of application you might think you need an array of specialized data storage systems, ones that specialize in different kinds of data. But data infrastructure sprawl brings with it a host of problems. See this blog for a walkthrough of the problems that pattern causes. What’s required is a database that is consistent, highly available, durable, and resilient. It needs to scale with you as your usage grows without forcing a costly re-architecture at every stage.It should allow you to secure your data and meet all your privacy requirements.It needs to handle the loading of data fast enough to meet your SLAs.It should support natively loading the data from all standard formats available.It must be capable of delivering all the analytics your application needs with no noticeable lag, no matter how busy the system is.It shouldn’t force you to make copies of the data and move them around to various systems. This is, admittedly, a tall order. But SingleStore, our fast unified database for data-intensive applications on any data, anywhere is changing the way application developers are powering their modern applications. If you want to learn more about database systems that are capable of handling these requirements, check out our free trial. We’d love to power up your next data-intensive application.
Read Post
The Future Of Business, Brought To Us By Data
Data Intensity

The Future Of Business, Brought To Us By Data

I believe that we are in a fascinating era of technology and that every generation before and after us will make the same claim! This is the era of data. My first introduction to the term Data Science was in the early 2000s when I heard that the Data Scientist is the sexiest job of the 21st century. Several trends around data converged during that period: Big data was on the rise and it marked not only a gradual change in volume: behavioral data, unstructured data, click-streams, images, audio and video were unlike the information stored in systems of records.Digital native companies built businesses on these new data sources, creating FOMO in the traditional business world.Computer science was changing its focus from computation to data. Data clearly had value beyond what most organizations traditionally derived from it. The digitally transformed organization was going to be a data-driven organization; data scientists were the rockstars making it happen. Fast forward to today. A new form of artificial intelligence—task-specific, data-driven AI—has augmented human capabilities over the past decade. The Internet of People has evolved into the Internet of Things. Data are everywhere—any type, any volume, and any velocity. Data becomes a way of life and that is a wonderful development in the digitally transforming world, Data becomes a way of thinking, reasoning, understanding, and communicating. As someone mentioned at the beginning of the COVID-19 pandemic: “The number of people who are looking at exponential curves has grown exponentially.” Data are beautiful, but data alone do not change the world. It is through the decisions based on data that we affect the world around us. The connective tissue between data and decision is formed by analytics, machine learning, and artificial intelligence. For a long time, the technology paradigm was to train us to consume more technology. We live in a world of technology-literate people who know how to build and deploy technology for others, and know how to use it. Like few other domains, data processing is amenable to automation and to training. If you can build systems based on training (learning from) data, then maybe we can train technology based on automating data. Is it time to turn the relationship around, and build people-literate technology? Technology that understands what we need and delivers that for us. The burden to translate intent moves from the user to the computer. Technology works for us and augments us. In this vision of the future, the traditional role of the CIO fades. The true information officer in the organization is the Chief Data & Analytics Officer. The data scientist will not be automated away, but part of her job today will be hidden in plain sight. If we can recommend which refrigerator to buy, we can recommend and automate table joins, data quality, and data integration. The data scientists of today are writing the future of what data-driven technology looks like. Their job as I see it is to empower society and to design what they create so that those who are not data scientists can explore and interact and learn from data. In return, the contribution and responsibility of data scientists will be elevated to something absolutely essential: running the business with data. Now that might actually be the sexiest job of the 21st century. To learn more about trends in data and analytics that are expected to hit the mainstream over the next three years, SingleStore has made available to you the Gartner report, Top Trends in Data and Analytics for 2021. Download the report now.
Read Post
The Modern Database Experience
Data Intensity

The Modern Database Experience

For Developers, By DevelopersSometimes, if you’re really lucky, you get unsolicited raves from developers who just love what your technology does for them.It’s amazing to be able to showcase opinions that developers write to share with other developers. We recently had this experience with Jack Ellis from Fathom, and this week, we received this contribution from Ero Carrera, Software Developer and Ex-Googler. We’re grateful to Ero for this guest blog post, where he shares his experience working with Modern Databases, how he discovered SingleStore experience with SingleStore, and how it’s been such a good fit for his engineering work.In Ero’s words:First, some background:I have been working with SQL on-and-off for over 20 years. Over that time, I spent uncountable hours with Oracle, MySQL, PostgreSQL, and SQLite, optimizing queries, backends, and designing schemas. While I learned a lot fiddling with different databases, a lot of effort went into setting them up, keeping them up to date, backing them up, and optimizing their settings to get the most out of each solution.These last ten years I worked at Google, where I was lucky to experience one of the most incredible technology stacks in the industry. As an engineer, the internal database technologies allow you to nearly forget about the backend and just use it, at ridiculously large scales, without having to care about storage or even optimizing queries much. Being able to just focus on building systems is something really easy to get addicted to.Some of the projects I worked on were on intelligence analysis for cybersecurity. On those it was necessary to find and analyze relationships between different datasets of indicators of malicious activity. Each of the datasets was accessible via an interface using the “Google flavored” SQL, what is known outside Google as BigQuery. To read the datasets one could simply write queries to produce and massage the data into the desired format and subsequently process that, no need to worry about the SQL engine, networking, or storage (for the most part). Each of those datasets were in the tens to hundreds of Terabytes (many billions of rows). Working at that scale and being able to forget about the stack of technologies (of course being conscious of performance and following best-practices) was simply incredible.Database DiscoveryUpon leaving Google last year, I was afraid about the state of affairs in the “real world”, whether I might have gotten too used to those fancy toys. I started taking a look at what’s available. I was hoping to find something of industrial strength that made my life as easy as possible, but still had all the bells and whistles of modern databases. My needs were leaning towards the relational side, while I love key-value databases, I needed to be able to run analytics on structured data, so I wanted something “SQL native”.My datasets were nowhere close to those I had at Google but I still wanted speed and the fast query times that enable “interactive research”. For my current projects I have time series data, where I need to join multiple tables, with complex nested queries, to extract aggregate statistics. The tables are in the tens of millions of rows, which are joined and grouped-by in several rounds. In the pipeline I’ve built so far I still haven't found a case where SingleStore doesn’t return results in near-real time.A good friend who was starting his second startup recommended ClickHouse for time-series data. I tried to play with it, but bumped into some friction setting up a test environment and then having to interface it with their own libraries. I needed something easy to maintain, with a very comfortable interface, and eventually found a great analysis, Selecting a Database for an Algorithmic Trading System, that convinced me to try SingleStore (still going by MemSQL in the article).I was drawn to it by how easy SingleStore is to interface with, just use the standard MySQL libraries and interfaces! A docker image was readily available, it could easily bulk-ingest CSV data, plus all the bells-and-whistles of distributed databases. The in-memory rowstore and on-disk columnstore reminded me of the similar optimizations of Google’s internal tools, which made them so incredibly fast. Some of the technologies used at Google are discussed in the white paper An Inside Look at Google BigQuery and some posts in Stack Overflow, where a brief explanation of ColumnIO, Google’s columnar format, is given.Additionally, the possibility of spinning up a managed instance in any of the main cloud providers seemed interesting, but was not on my radar originally (more on that in a bit).My SingleStoreModernNew Horizons in the Database Experience:My initial experience with SingleStore’s Docker image was very smooth. I was able to get it up and running within minutes. Documentation was up-to-date and I had no trouble setting it all up.My Python code could just use MySQL Connector/Python to connect to the instance, it all worked. The workflow calls for scheduled ingestions of a few thousand to a few million records from CSV dumps. LOAD DATA worked like a charm to read them, only taking a few seconds for the largest dataset.My previous pipelines exported data into HDF5 that was later analyzed in Mathematica. I found that set up terribly slow and cumbersome compared with SingleStore plus a BI tool like Metabase.More recently I’ve been tempted to try SingleStoreDB Cloud to not even have to bother with the occasional updates (which were pretty easy anyway) or starting/stopping the Docker image. It could not have been easier, again I found the documentation clear and I was able to launch my own cluster with ease. I only needed to update the config files to point my pipelines to the new instance (I chose to host it in AWS) and it just worked.Additionally, given that I had to make zero code changes, I can easily go back and forth between the Docker instance and the managed cluster, just by updating a configuration file.There was one wish I had related to the managed service, it was that of being able to suspend it when I’m not using it. I am currently working on a hobby project, I do not need other users accessing the database, hence keeping the instance spinning is a bit wasteful. I reached out to the very responsive SingleStore team and they were happy to let me know that there’s an upcoming on/off feature. So I’ll be able to rely on the managed instance exactly as much as I need to. How great is that?I have also been playing with UDFs to compute some metrics for my analytics and again, it works. I could simply write function definitions, for scalar functions, in SQL and call them from my queries, leading to much simpler SQL.ConclusionI have yet to find any place where SingleStore doesn't meet my needs. Granted, I have not yet pushed it very hard, but it has made me much more productive, allowing me to focus on the logic of my pipelines (Prefect, ingesting & preparing data, and training & deploying some ML models) and building analytics dashboards, for which I am using Metabase (which also works like a charm with SingleStore).While there’s extensive documentation on how to optimize for high-performance settings for more demanding conditions. I have not yet managed to write any queries that take more than a few seconds at most. Multiple WITH clauses and nested queries are handled gracefully.I am very happy to have come across SingleStore. It has made me much more productive and provided me with a “database experience” as the one I had at Google. I only look forward to my projects to grow and see how far it can be pushed!Ero CarreraSoftware Engineer, Ex-GooglerExperience SingleStore for yourself! Install SingleStoreDB Self-Managed for FREE or Deploy SingleStoreDB Cloud with $500 in FREE credits.
Read Post
HumAIn Podcast: How to Power Enterprises with Intelligent Applications with Jordan Tigani of SingleStore
Data Intensity

HumAIn Podcast: How to Power Enterprises with Intelligent Applications with Jordan Tigani of SingleStore

Jordan Tigani is the Chief Product Officer at SingleStore. He was the co-founding engineer on Google BigQuery. He also led engineering teams then product teams at BQ.Overview of HumAIn PodcastSeason 5 |Posted on May 5, 2021|Posted in: #data science, #fast analytics, #intelligent applications, #jordan tigani, #singlestoreSingleStore powers Comcast with their streaming analytics to drive proactive care and real-time recommendations for their 300K events per second. Since switching to SingleStore, Nucleus Security converted its first beta account to a paying customer, increased the number of scans Nucleus can process in one hour by 60X, and saw speed improvement of 20X for the slowest queries.To be more competitive in our new normal, organizations must make real-time data-driven decisions. And to create a better customer experience and better business outcomes, data needs to tell customers and users what is happening right now.With the pandemic accelerating digitization, and new database companies going public (Snowflake) and filing IPOs (Couchbase), the database industry will continue to grow exponentially, with new advanced computing technologies emerging over the next decade. Companies will begin looking for infrastructure that can give real-time analytics — they can no longer afford to use technology that cannot handle the onslaught of data brought by the pandemic.True Digital in Thailand utilizes SingleStore’s in-the-moment analytics to develop heat maps around geographies with large COVID-19 infection rates to see where people are congregating, pointing out areas to be avoided, and ultimately, flattening the curve of COVID-19. In two weeks’ time, SingleStore built a solution that could perform event stream processing on 500K anonymized location events every second for 30M+ mobile phones.Businesses need to prioritize in-app analytics: This will allow you to influence customer’s behaviors within your application or outside of it based on data. Additionally, businesses must utilize a unified database that supports transactions and analytics to deliver greater value to customers and business.Enterprises must access technology that can handle different types of workloads, datasets and modernize infrastructure, and use real-time analytics.Please click here for the full podcast.Shownotes Links:https://www.linkedin.com/in/jordantiganihttps://twitter.com/jrdntgnwww.SingleStore.comhttps://www.linkedin.com/company/singlestore/https://www.singlestore.com/media-hub/releases/research-highlights-spike-in-data-demands-amid-pandemic/https://www.singlestore.com/media-hub/releases/businesses-reconsidering-existing-data-platforms/
Read Post
Why You Should Use a Scale-out SQL DBMS Even When You Don't Need One
Data Intensity

Why You Should Use a Scale-out SQL DBMS Even When You Don't Need One

Everybody has heard of the KISS principle -- Keep It Simple, Stupid. When applied to data management technology, it implies that you should use a simple solution when it works. Developers often believe that means using a single-node DBMS like MySQL, PostgreSQL, or SQL Server. So, what is a “single-node” database? Essentially, it’s a database designed for a single machine. The capacity of a single machine dictates the resource constraints of processing power, connections, and storage. If you need more processing power or storage, you can vertically scale, meaning you upgrade to a more powerful machine. This can work, up to the scale limits of the largest available machine. By contrast, scale-out databases are built as distributed databases from the start, that is, designed to be run across machines. Sometimes people see scale-out relational database technology, like SingleStore, as strictly for high-end applications, beyond what one single-node DBMSs can handle. The thinking goes that they are inherently more complex to use than a single-node DBMS. I'm here to tell you that the KISS principle is right. But people who think it means you should only use scale-out for "extreme" apps are dead wrong. Let's look at the arguments for why scale out is not "simple," and address them one by one. We'll see that in many cases, scale out actually makes things simpler. It's hard to get and set up and manage the hardware and software. First, database platform-as-a-service (PaaS) offerings like SingleStoreDB Cloud handle all that for you. You can choose a size and dial up and down whenever you need to. For self-hosting, using the public cloud or a well-organized enterprise data center with a range of hardware SKUs means you can pick out and provision machines fairly easily. My single-node database is fast enough. For some problems with small data, a single-node DBMS may be fast enough. But there are thousands of applications out there with medium-to-large data on a single node system. The people who built them may think they are fast enough, but what if they could run queries instantaneously? Research shows that response time under ¼ second feels instantaneous, and that generates incredible user satisfaction. This drives users to explore more freely, learning more about the data, which helps them make better decisions. Your single-node DBMS might be able to provide near-instant responses for a few users, but what if there are many users? Enterprises are pursuing digital transformation initiatives to get more out of the data they have, not just scale to handle bigger data sets. The levels of speed and concurrency you can get from scale-out enable new applications that unlock data's value, enabling digital transformation. If my problem gets big, I can just add more nodes of my regular RDBMS.  A common pattern for scaling applications is to create multiple databases on multiple nodes using a single-node DBMS. This could be done in a number of ways, such as\ (a) putting each customer's data in a different database with the same schema, and spreading those databases across multiple nodes or\ (b) creating replicas of the same database to scale out the workload.\ Either way, this is anything but simple because you have to decide at the application level how to split up your data. Moreover, there might be a situation where you need more than one node to handle the workload for a single database. At that point, your single-node database runs out of steam. My problem is not big enough to benefit from scale out.  You'd be surprised about how small an application can be and still benefit from scale out. Here are a couple of examples. First, SingleStore has a customer with a few billion rows of data in a data mart that would easily fit on a single-node database. But they are extending quarter-second response time for dashboard refreshes to several hundred concurrent users, with intra-day updates, enabling a complete digital transformation in how the data is consumed. It's sort of a real-time data mart. As a second example, we have customers that have less than a million rows of data but are using brute-force searches of vector data for AI image-matching applications using DOT_PRODUCT and EUCLIDEAN_DISTANCE functions in SQL. This brute force approach gives them better match fidelity than multi-dimensional vector indexing that's not available in SQL DBMSs to date, and still let's them integrate their match queries with other SQL query constructs. And see the later discussion about using brute-force scale out to simplify away the need for complex solutions like pre-calculated aggregate tables. Plenty of people with only a few hundred thousand rows of data can benefit from that. It's hard to design my scale-out database to get it to perform. Yes, there are a couple of new concepts to learn with a scale-out database, mainly sharding (for partitioning data across nodes) and reference tables (for duplicating small dimension tables onto each node). But the horsepower you get from a good, high-performance scale-out database actually simplifies things. E.g. you may need materialized views or pre-calculated summary aggregate tables with a single-node database but not with a scale-out database. Pre-calculated aggregates and materialized views are tricky to use and introduce design problems that are conceptually harder than deciding how to shard your data. If you know when to use an index, you can learn how to shard your data and use reference tables in a few minutes. And you don't have to get it right the first time; it's easy to reconfigure (create a new table with the configuration you want, INSERT...SELECT... data into it, drop the old one, rename the new one, and you're done). I can't find people with the skills to use a scale-out DBMS.  Modern SQL-based scale-out databases are based on standard SQL, and are often compatible with MySQL or Postgres. SingleStore is largely compatible with MySQL, for example. So hundreds of languages and tools can connect to these SQL-based scale-out databases. And almost all of the SQL skills people have from working with SQL databases like MySQL and Postgres are transferable. I want to use a general-purpose database and there are no general-purpose databases with scale out that work for me. A general purpose tool simplifies life, that's true, by making skill sets applicable to multiple problems, and reducing the effort to search for the right tool. Fortunately, there is a general-purpose SQL database that scales out and runs anywhere. I think you know what database that is. I could go on, but you get the idea -- the spartan simplicity of relational databases and SQL carries its benefits over to scale-out SQL systems. And scale out simplifies lots of things, and enables digital transformation opportunities that can be valuable to your business and your career. There's a corollary to the KISS principle that applies here also, often attributed to Albert Einstein: "Everything should be made as simple as possible, but no simpler."  In this context, that means you shouldn't give up on really valuable application innovations made possible by the performance you can get from scale out, due to perceived complexity. Finally, scale out is to improve speed and scalability. SingleStore scales out, but it also has other important technologies to make things faster, including in-memory row store tables, columnstores, compilation of queries to machine code, vectorization, and a high-performance plan cache. All of these things squeeze the most out of each processor core in your system. So, next time you are about to reach for your trusty single-node DBMS, ask yourself, can I reach higher, and do more, and keep it simple at the same time?
Read Post
How to Accelerate Analytics for SaaS Developers
Data Intensity

How to Accelerate Analytics for SaaS Developers

Everyone remembers their first experience in a Tesla. Mine was in Las Vegas with a good friend and CIO. I was in town for a tech conference and that provided a good opportunity to reconnect and discuss a new project. He offered to pick me up at my hotel. That was the moment I was first introduced to Tesla’s unique Ludicrous mode. It was exhilarating. The Strip became one long, breathless blur. If you haven’t experienced zero to 60 mph in 2.2 seconds, take a look at the “Tesla reactions” genre of videos online. Breakthroughs in Scaling Data Analytics Wouldn’t it be great to get that kind of reaction to your product experience? Whether your customers are launching queries for generating fastboards, leaderboards for gaming analytics, filtering audience views, or generating BI reports, a constant challenge is scaling your data infrastructure without slowing your services or showing your users the dreaded spinning wait animation. It may be time to hit the accelerator on the analytics in your SaaS product to give your users the thrill of your version of Ludicrous mode. SingleStore is the data engine that powers Ludicrous mode for real-time, interactive data analytics in your SaaS applications. As an application developer, you generally choose the database that you’re familiar with, is general purpose, and has broad support. If you’re building in a cloud environment, this often leads you to a hosted relational database, like Azure SQL, AWS RDS for MySQL or Google Cloud SQL. These work fine early on, but start to show cracks as your SaaS product gains rapid adoption. This is the moment you start to encounter data bottlenecks which show up in your customer experience. Solving the data bottlenecks is the essential thing to get right and is the frequent obstacle. How can you ensure backend scalability while simultaneously focusing on delivering a simple, easy-to-use service? When application developer Gerry Morgan started encountering data bottlenecks in DailyVest’s portfolio analytics API for their 401(k) customers, he and fellow engineer Kevin Lindroos identified the culprit as their AzureSQL database. While it had served them well initially, as data grew their costs grew but performance was never more than simply adequate. As they extrapolated their customer growth plans, they determined that they needed a better way to control costs as they grew their customers. So, they began the search for a new database platform that could support their growth and the ad hoc analytical queries over large data volumes their portfolio analytics product required. This led them to consider columnstore databases. For application developers unfamiliar with columnstores, they are generally best for analytical workloads whereas rowstores are generally best at transactional workloads. (Should you use a rowstore or columnstore?) After investigating columnstore databases such as AWS RedShift, Vertica, MariaDB, and kdb+, they discovered SingleStore met - or rather exceeded - all of their requirements. The benefits were clear. It had a better total cost of ownership, provided a managed service in Azure, executed stored procedures multiple times faster than AzureSQL, and accelerated database backups from 1 hour to just 10 minutes. To learn more about how these application developers scaled and accelerated the analytics in their SaaS applications, watch How DailyVest Drove a 90% Performance Improvement. For IEX Cloud, the data bottleneck they encountered when scaling their cloud service was a little different. IEX Cloud is a division of IEX Group, the company made famous by Michael Lewis’ 2014 book “Flash Boys: A Wall Street Revolt”. The key service IEX Cloud delivers is a real-time financial market data API. It requires the collection and aggregation of many historical and real-time data sources which are then processed and served to their customers. Balancing the flow of data from disparate sources and to over 130,000 consumers with serving over 1.5 billion API responses per day and 800,000 data operations per second demands quite a lot of simultaneous read and write volume on their database backend. Tracking the real-time changes in stock prices is a write-intensive operation while serving billions of API requests against that fast-changing data is read-intensive. Serving real-time streaming analytics with metrics like P/E ratios and market capitalization on real-time streaming data and historical data adds compute-intensive workloads to the mix. Furthermore, as a data aggregator, IEX Cloud must refresh reference data from many 100s of providers throughout the day through ETL. They expect the number of ETL processes will soon be in the 1000s. Compounding the situation, daily market volatility correlates to volatility in the volume of API traffic from their customers. IEX Cloud needed improved performance in multiple areas that their initial Google Cloud SQL for MySQL service wasn’t delivering. These requirements include high performance bulk-loaded data through ETL, streaming data ingestion, store all the data, perform real-time analytics, low latency responses to many parallel API requests, and the ability to easily scale horizontally. After trying a variety of database types, including CockroachDB, YugaByte, Clickhouse, and Google BigQuery, IEX Cloud found that only SingleStore could satisfy all of their requirements, and do it in just one system that was cost-effective, had an established community, and good support. Learn more about this SaaS analytics scaling challenge in The FinTech Disruption of IEX Cloud webinar. Common First Steps When performance and scaling issues arise, application developers are among the first to know in today’s DevOps world. Application performance monitoring alerts the team and triage get into motion. At this point, if a DBA is available the investigation begins, queries are profiled and indexes are modified or added. If handling read volume is the issue, a common technique is to provide a read replica by replicating from the primary database. This offloads work, but at the cost of adding latency and duplicating data. If the data is fast-changing, the approach is less effective as the data in the replica is out-of-date all too often. Caching is the next option for scaling read-heavy workloads. You’ve seen it work great for static assets in Gatsby, Next.js, or React Static, but managing your dynamic business data this way is another animal. Managing cache eviction is complicated and expensive for fast-changing data. Another challenge is that the size of your cached data must fit into the memory of a single machine. This works well for small datasets, but you’ll soon be looking for a way to scale the cache if your data is large. Scaling out a cache by adding more nodes, for Redis for instance, provides the availability of data but at the cost of data consistency. It also adds infrastructure and more complexity. Another option for scaling is to use database partitioning. This technique cuts the data into sizes that will fit into a single server, no matter how much data you have. There are various types of partitioning/sharding to ensure no downtime or data loss in the event of a node failing. There are various approaches for partitioning and indexing the data based on your queries. You can try to do this yourself, but it may get you more than you bargained as an application developer. There is an easier way which provides the scalability with the speed and simplicity you need. Solving for Scale, Simply SingleStore solves 3 key challenges for SaaS applications which embed live or real-time analytics without the need of changing your application design, adding a cache, manual partitioning, or other external middleware: Scaling Data IngestScaling Low Latency QueriesScaling User Concurrency SingleStore is a distributed, highly-scalable SQL database. This is an important characteristic as it’s the foundation for how it addresses each of the SaaS analytics scaling issues. For scaling data ingestion, SingleStore breaks the scaling bottleneck by providing parallel ingestion from distributed streaming sources in addition to bulk data loads. As mentioned earlier, both are important for IEX Cloud’s analytics API. Next, query responses can return trillions of rows per second on tables with over 50 billion records. This is the kind of low latency query speed that makes fans out of your SaaS application users. Finally, SingleStore scales to meet the need of your growing customers through support for high concurrency. For Nucleus Security, every millisecond matters when it comes to thwarting cyberattacks. Their original database choice was MariaDB but it failed to keep up with their needs to perform more frequent vulnerability scans and serve real-time analytics to a quickly growing number of government users. SingleStore delivered the high concurrency needed for their rapidly growing SaaS applications while dramatically improving performance by 50x, at ⅓ the costs of the alternatives. Scott Kuffer, co-founder of Nucleus Security, describes the details in the Every Millisecond Counts in Cybersecurity webinar. Every successful SaaS application includes analytics either as a core offering or as an adjunct feature. Analytics places increased demand on the database. Customers won't wait for your spinning animation while their data loads. So, it is imperative to deliver the app and the analytics fast. Otherwise, you see more incidents raised, negative reviews are posted on social feeds and review sites, and you may find that your customer churn increases. In short, your business goes down and you have to fight to get your growth back. This can happen in the blink of an eye when switching costs are relatively low for SaaS services. That’s no way to claim your slice of the \$157 billion global SaaS application market. SingleStore accelerates and scales the analytics in your SaaS application by delivering scalable data ingestion with single millisecond low latency queries queries, and high concurrency. But beyond the ludicrous speeds and thrills it delivers, our customers rave about our customer support and optimization services, our established robust community, and how cost-effective the solution is. To learn more about SingleStoreDB Cloud to scale the analytics in your SaaS application, join us for the upcoming webinar on April 8.
Read Post
AWS SageMaker and SingleStore: Operationalizing Machine Learning at Scale
Data Intensity

AWS SageMaker and SingleStore: Operationalizing Machine Learning at Scale

Abstract Many organizations today are looking for ways to run Machine Learning (ML) models in operational systems at scale. Data Scientists can now take models that they build in SageMaker and deploy them as user-defined functions within our general purpose, relational database. Read more about how you can accelerate the execution of your models against real time data. You can also find in-depth notebooks detailing how to implement this on GitHub. Setting the Stage AWS SageMaker has quickly become one of the most widely used data science platforms in the market today. Evidence suggests that even though there has been a proliferation of models in the enterprise, only 17% of companies have actually deployed a machine learning model into production. The following tutorial shares how we at SingleStore are helping our customers do just that. Today, many enterprises are seeking to complete that machine learning lifecycle by deploying models in-database. There are many reasons why AI Engineers are taking this approach, as opposed to operationalizing their models using SageMaker: First, training and deploying models where the data lives means your models are always based on the entirety of your data, even the freshest recordsKeeping models in the same place as the data also reduces latency and cost (i.e., EC2) that the separation of the two may presentIn the case of real-time data, users can even run their models against every new live data point that is streamed into the database with amazing performance (while continuing to reference historical data) Here at SingleStore, we are enabling AI developers to deploy models within our converged data platform. Our ability to ingest data in real-time from anywhere, perform sub-second petabyte-scale analytics with our Universal Storage engine and support for high concurrency makes us the perfect home for your models. The next section details how SageMaker and SingleStoreDB Cloud, our cloud database-as-a-service offering, coexist to help accelerate your machine learning models. Reference Architecture There are many different ways that machine learning engineers can use SageMaker and our SingleStoreDB Cloud together. The reference architecture below states just one approach, but these components can be variably interchanged based on your preferred architecture. Below is an example of how SingleStore is able to easily leverage your existing data and modeling ecosystem of S3 and SageMaker working in-concert to make models run faster. In this architecture, models are built in SageMaker using data from S3. The models are then converted into UDFs using SingleStore’s “SageMaker to Python” library. At that point, the models live in the database and can be executed against real-time data streams coming from Kafka using native SQL. SingleStore also supports many other real-time and batch ingest methods, listed here.
Read Post
If Your Business Is Not Working In Real Time, You’re Out Of Time To Capture The Business Moment
Data Intensity

If Your Business Is Not Working In Real Time, You’re Out Of Time To Capture The Business Moment

Business is about serving the needs of customers. But customer expectations are changing quickly, and most organizations are not truly aware of how fast that’s happening. Most businesses are moving in slow motion relative to their customers. That means they miss out on opportunities to make decisions about and act on the moments that matter. In the past, lag time was accepted. Nielsen called people on the phone to understand their TV viewing habits. Broadcast TV networks set advertising rates and advertisers gauged viewership based on Nielsen ratings. It took a long time for a legion of people to collect this data, and once they got the data, it was typically a small and outdated sample size. But this was the best available method given the technology of the time. Today these types of approaches simply don’t work — and they don’t have to. Organizations can use modern technology to move quickly and benefit from in-the-moment opportunities. That enables them to act in real time to deliver better experiences to retain and add customers — and optimize solutions for their clients and business partners. What Is Real Time? The definition of “real time” depends upon the context. In the context of a video streaming service, “now” means instantaneously. If you’re serving up pixelated videos or you can’t deliver an advertisement, you can lose consumer users or advertising sponsors. Latency is also a conversion killer for websites and a costly problem for financial traders. Akamai, one of my company’s clients, reports that conversion rates drop 7% for every 100 milliseconds of added latency. Real time can mean seconds or minutes. Thorn, another client of my company, which works to prevent child sex trafficking, processes massive amounts of web data quickly. This improves child identification and investigation time by up to 63%. Each passing minute matters and determines the likelihood of saving a child. Speed is also key in fighting the pandemic. True Digital, also one of my company’s clients, is using real-time data to monitor human movement using anonymized cellular location information. This can help authorities prevent large gatherings that can become coronavirus hot spots. In each of these scenarios, what is considered “real time” is dependent upon the context and the goal. But all of these scenarios define crucial moments in which having the relevant current and historical data immediately available for processing is essential. In-The-Moment Decision-Making Requires Infrastructure Simplicity You have to simplify to accelerate business in this way. To go faster and get finer-grained, real-time metrics, you can’t have 15 different steps in the process and 15 different types of databases and storage. That adds up to too much latency and exorbitant maintenance fees. Instead, you need to be able to do the same things and add new business functions, with less infrastructure. This requires technology convergence. As Andrew Pavlo of Carnegie Mellon University and Matthew Aslett of 451 Research wrote, NewSQL database management systems now converge the capabilities that in the past were implemented one at a time in separate systems. This a byproduct “of a new era where distributed computing resources are plentiful and affordable, but at the same time the demands of applications [are] much greater.” Now you can go faster. You can make decisions and act on them in real time. You’re in the game rather than sitting on the sidelines waiting for information while competitors are acting. Modern Businesses And Their Customers Benefit From Real-Time Data Today FedEx founder and CEO Fred Smith said in 1979 that “the information about the package is as important as the package itself.” This highlights the power of data. Companies like FedEx now use this power to dynamically reroute packages based on customer interactions and to optimize their routes. Real-time data allows customers to employ digital interfaces to see when and where their packages will be delivered, request that a package be sent to an alternate location and have that request honored. It’s not just FedEx that’s doing this; other companies like DHL and UPS have done dynamic rerouting for years. This is important because people are a lot more mobile these days; customers expect businesses to be more responsive to their needs and tend to give businesses that cater to them higher customer satisfaction and Net Promoter Scores. On-time delivery helps logistics companies avoid missing service level agreements and then paying penalties. You can’t do route optimization and dynamic rerouting if your information about the package and other relevant details is hours behind where the package actually exists. You need your digital environment to mirror what’s happening in the real world. When you create a digital mirror of your environment, you get what is called a digital twin. As our co-founder recently explained, digital twins are often associated with industrial and heavy machinery. But organizations in many sectors are now exploring and implementing digital twins to get a 360-degree view of how their businesses operate. This requires organizations to have converged, cloud-native, massively scalable and fast-performing infrastructure that supports artificial intelligence and machine learning models. Organizations that don’t have these capabilities will be outmaneuvered by faster companies that do have the intelligence and agility to make decisions and act in the now. Embracing Intelligence And Agility Understand that delivering faster data isn’t the objective. The objective is to deliver the optimal customer experience and improved operational insights. Let these two objectives be your guide — and seek ways to leverage all relevant data in the moments that matter. Dreaming big is important. But to start, identify a small project combining current, live and real-time data with historical data for in-the-moment views and automated decision-making and trendspotting to address customer experience or operational opportunities or challenges. Polyglot persistence provides real development advantages. But it’s not necessary to assemble multiple types of data stores to get those advantages. Choose simplicity with flexibility by searching for solutions that provide support for a spectrum of workloads, reducing cloud data infrastructure complexity. This was previously posted on Forbes.
Read Post