The Power of SQL for Vector Database Operations, Part 1
Engineering

The Power of SQL for Vector Database Operations, Part 1

Here’s how to use one SQL query to get the top K vector matches in each category.At SingleStore, our position on vector database processing is that you should have all the benefits of a modern, full-featured DBMS available to you when working with vectors [Han23]. This includes full SQL support.We're working with a prospective user of SingleStoreDB for vector database operations related to an application with a semantic search component. He asked how we could do the following easily: find the top K items in each category. This is something he didn't have an easy time with in Milvus, a specialty vector database.With Milvus, he was finding the categories in one query, looping through them and finding the top K elements for one category at a time with a separate query. This is not easily parallelizable, and requires more work from the application side than many would prefer.Here's how you can do this in a single SQL query in SingleStoreDB:/* Make some items in multiple categories, with associated   vector embeddings. */create table items(id int, category varchar(50), vector blob);insert into items values  (1, "food", json_array_pack('[0,0,0,1]')),  (2, "food", json_array_pack('[0,0.5,0.3,0.05]')),  (3, "food", json_array_pack('[0,0.5,0.2,0]')),  (4, "facilities", json_array_pack('[0,0,1,0]')),  (5, "facilities", json_array_pack('[0,0.6,0.1,0.05]')),  (6, "facilities", json_array_pack('[0,0.4,0.3,0]'));-- query vectorset @qv = json_array_pack('[0,0.4,0.3,0]');-- get top 2 in each category using rankingwith scored as(  select id, category, dot_product(vector, @qv) as score  from items),ranked as (select  row_number() over(partition by category order by score desc)    as rank, *  from scored)select *from rankedwhere rank <= 2order by category, rank;These are the results:+------+------+------------+---------------------+| rank | id   | category   | score               |+------+------+------------+---------------------+|    1 |    4 | facilities | 0.30000001192092896 ||    2 |    5 | facilities | 0.27000001072883606 ||    1 |    2 | food       |  0.2900000214576721 ||    2 |    3 | food       | 0.25999999046325684 |+------+------+------------+---------------------+It’s important to note we're just focusing on ease of expression here, not performance. For this particular application, the scope is usually a few million vectors at most — so a full-scan, exact-nearest-neighbor approach is plenty fast.When you choose a tool for vector processing for nearest-neighbor search applications like semantic search, chatbots, other LLM applications, face matching, object matching and more, we think it's a good idea to consider the power of the query language — and having full SQL available just makes things easier.References[Han23] E. Hanson and A. Comet, Why Your Vector Database Should Not be a Vector Database, SingleStoreDB blog, April 24, 2023.Explore more vector database-related resourcesSingleStoreDB VectorseBook: Selecting the Optimal Database for Generative AI
Read Post
Why Your Vector Database Should Not be a Vector Database
Product

Why Your Vector Database Should Not be a Vector Database

The database market is seeing a proliferation of specialty vector databases.People who buy these products and plumb them into their data architectures may find initial excitement with what they can do with them to query for vector similarity. But eventually, they will regret bringing yet another component into their application environment.Vectors and vector search are a data type and query processing approach, not a foundation for a new way of processing data. Using a specialty vector database (SVDB) will lead to the usual problems we see (and solve) again and again with our customers who use multiple specialty systems: redundant data, excessive data movement, lack of agreement on data values among distributed components, extra labor expense for specialized skills, extra licensing costs, limited query language power, programmability and extensibility, limited tool integration, and poor data integrity and availability compared with a true DBMS.Instead of using a SVDB, we believe that application developers using vector similarity search will be better served by building their applications on a general, modern data platform that meets all their database requirements, not just one. SingleStoreDB is such a platform.SingleStoreDBSingleStoreDB is a high-performance, scalable, modern SQL DBMS and cloud service that supports multiple data models including structured data, semi-structured data based on JSON, time-series, full text, spatial, key-value and vector data. Our vector database subsystem, first made available in 2017 and subsequently enhanced, allows extremely fast nearest-neighbor search to find objects that are semantically similar, easily using SQL. Moreover, so-called "metadata filtering" (which is billed as a virtue by SVDB providers) is available in SingleStoreDB in far more powerful and general form than they provide — simply by using SQL filters, joins and all other SQL capabilities.The beauty of SingleStoreDB for vector database management is that it excels at vector-based operations and it is truly a modern database management system. It has all the benefits one expects from a DBMS including ANSI SQL, ACID transactions, high availability, disaster recovery, point-in-time recovery, programmability, extensibility and more. Plus, it is fast and scalable, supporting both high-performance transaction processing and analytics in one distributed system.SingleStoreDB Support for VectorsSingleStoreDB supports vectors and vector similarity search using dot_product (for cosine similarity) and euclidean_distance functions. These functions are used by our customers for applications including face recognition, visual product photo1 search and text-based semantic search [Aur23]. With the explosion of generative AI technology, these capabilities form a firm foundation for text-based AI chatbots.The SingleStore vector database engine implements vector similarity matching extremely efficiently using Intel SIMD instructions.
Read Post
Forrester
SingleStore Recognized In

The Forrester WaveTM

Translytical Data
Platforms Q4 2022

AI-Powered Semantic Search in SingleStoreDB
Engineering

AI-Powered Semantic Search in SingleStoreDB

SingleStoreDB can supercharge your apps with AI. In this blog, we demonstrate how semantic search can be performed on your data in SingleStoreDB — including code examples and a motivating case study from Siemens, a SingleStore customer.What Is Semantic Search?At its core, semantic search relies on natural language processing (NLP) to accurately interpret the context and intent behind a user's search query. Unlike traditional keyword-based search methods, semantic search algorithms take into account the relationship between words and their meanings, enabling them to deliver more accurate and relevant results — even when search terms are vague or ambiguous. Semantic search relies heavily on machine learning algorithms to identify language patterns and understand concept relationships. Embeddings are a key tool in semantic search, creating vector representations of words that capture their semantic meaning. These embeddings essentially create a "meaning space," where words with similar meanings are represented by nearby vectors.What Is SingleStoreDB?SingleStoreDB is a real-time, distributed SQL database designed to handle both transactional (OLTP) and analytical (OLAP) within a unified engine. With support for fast writes and efficient querying, SingleStoreDB excels at managing large-scale transactional workloads and delivering real-time analytics.SingleStoreDB is available as a cloud service (SingleStoreDB Cloud) or for self-hosted installation.SingleStoreDB is also a multi-model database and provides vector database extensions, in addition to support for relational, semistructured, full-text, spatial and time-series data. Its vector capabilities include built-in functions for vector similarity calculations such as cosine similarity and Euclidean distance. These functions can be leveraged in SQL queries to perform similarity calculations efficiently on large volumes of vector data. Moreover, filters on metadata (other descriptive data about objects for which you've created vector embeddings) can be easily intermixed with vector similarity search, by simply using standard SQL WHERE clause filters. An easy way to get started is to sign up for a SingleStoreDB Cloud trial — and get $500 in credits.Is SingleStoreDB the Optimal Foundation for Semantic Search in Your Applications?SingleStoreDB's patented Universal Storage supports both OLTP and OLAP workloads, making it ideal for semantic search use cases. Adding embeddings to your data is simple — just place the vector data in a binary or blob column, using json_array_pack() or unhex() functions.Efficient retrieval of high-dimensional vectors and handling of large-scale vector similarity matching workloads are made possible by SingleStoreDB’s distributed architecture and efficient low-level execution. You can also rely on SingleStoreDB’s built-in parallelization and Intel SIMD-based vector processing to take care of the heavy lifting involved in processing vector data. This enables you to achieve fast and efficient vector similarity matching without the need for parallelizing your application or moving lots of data from your database into your application. We previously benchmarked the performance of our vector matching functions in our blog, “Image Matching in SQL with SingleStoreDB.” We ran the dot_product function as a measure of cosine similarity on 16 million records in just 5 milliseconds.With its support for SQL, SingleStoreDB provides developers with a familiar and powerful interface for building semantic search applications. SQL can be used to create complex queries and perform advanced analytics on the text data stored in SingleStoreDB. In fact, with just one line of SQL, developers can run a semantic search algorithm on their vector embeddings, as demonstrated in the following example.SingleStoreDB's ability to update and query vector data in real-time enables us to power applications that continuously learn and adapt to new inputs, providing users with increasingly precise and tailored responses over time.  By eliminating the need for periodic retraining of machine-learning models or other time-consuming processes, SingleStoreDB allows for seamless and efficient provision of real-time insights.See Semantic Search with SingleStoreDB in Action!The following tutorial will guide you through an example of adding embeddings to each row in your SingleStoreDB database using OpenAI APIs, enabling you to run semantic search queries in mere milliseconds using Python. Follow along to add embeddings to your dataset in your desired Python development environment.Our goal in this example is to extract meaningful insights from a hypothetical company’s employee review dataset by leveraging the power of semantic search. By using OpenAI's Embeddings API and vector matching algorithms on SingleStoreDB, we can conduct sophisticated queries on the reviews left by employees about their company. This approach allows us to delve deeper into the true sentiments of employees, without being constrained by exact keyword matches.Step 1: Install and import dependencies in your environmentInstall the following dependencies in your development environment using pip3.pip3 install mysql.connector openai matplotlib plotly pandas scipyscikit-learn requestsThen start python3 —and at the python3 command prompt, import the following dependencies.import osimport openaiimport jsonfrom openai.embeddings_utils import get_embeddingimport mysql.connectorimport requestsStep 2: Create an OpenAI account and get API connection detailsTo vectorize and embed the employee reviews and query strings, we leverage OpenAI's embeddings API. To use this API you will need an API key, which you can get here. You'll need to add a payment method to actually get vector embeddings using the API, though the charges are minimal for a small example like we present here. Once you have your key, you can add it to your environment variables as OPENAI_API_KEY.os.environ["OPENAI_API_KEY"] = 'youropenAIAPIKey'openai.api_key = os.getenv("OPENAI_API_KEY")Step 3: Sign up for your free SingleStoreDB trial and add your connection details to your Python environmentWe'll go through the example using SingleStoreDB Cloud, but of course you can self-host it and run the example in a similar way. If you're going to use our cloud, sign up for your SingleStoreDB Cloud trial and get $500 in credits.First, create your workspace using the + icon next to your desired workspace group. S-00  is sufficient for this use case.
Read Post
Image Matching in SQL With SingleStoreDB
Engineering

Image Matching in SQL With SingleStoreDB

Vector functions in SingleStoreDB make it possible to solve AI problems, including face matching, product photo matching, object recognition, text similarity matching and sentiment analysis. In this article, we’ll demonstrate how we use the dot_product function (for cosine similarity) to find a matching image of a celebrity from among 16 million records in just 5 milliseconds! And it's easy – SingleStoreDB does the heavy lifting of parallelization and SIMD-based vector processing for you so you can worry about your application, not your data infrastructure. Other vector functions supported in SingleStoreDB include euclidean distance calculation, transforming JSON arrays to binary vectors, vector math and vector manipulation. Want to see our YouTube video on this topic instead of reading about it? Check it out here.
Read Post
Winter 2022 Release: Toys for Developers in SingleStoreDB
Product

Winter 2022 Release: Toys for Developers in SingleStoreDB

In our Spring 2022 blog, we shared how the mechanics one of us used to work with anticipated the arrival of the Snap-On tool truck. The mechanics were like kids waiting for toys from Santa! It's our pleasure to bring you — our application makers and fixers — the latest delivery of the SingleStoreDB tool truck, right in time for the holiday season! In our latest release (Winter 2022, version 8.0 available on both Cloud and Self-Managed deployments), we're delivering over a dozen new features and tools that developers can use to build powerful real-time applications and analytics. The features span transaction processing, analytics, JSON/semi-structured data, spatial, time series, full text, security, observability, etc. This release also enables administrators to deploy and manage SingleStoreDB with more flexibility, security and control than ever before. Read on to learn about our new support for graph and tree data processing in distributed SQL with recursive common table expressions, fast transaction processing on string and JSON data with Universal Storage, enhanced security, improved incremental backup performance, Wasm programmability everywhere and so much more. Developer Experience Programmability for analytics on hierarchical data with Recursive CTEs Common Table Expressions, or CTEs, enable users (devs/ DBAs) to simplify and improve the readability and maintainability of complex queries. When it comes to querying hierarchical data, database users often need to perform some SQL ‘acrobatics’ to get the right results (hierarchical data includes things like org charts or production bill of materials, where products are made of components, components are made up of subcomponents and so on).  While SingleStoreDB has always supported regular (non-recursive) CTEs, the 8.0 release will include support for Recursive CTEs as well. With recursion, a CTE can reference itself. The CTE is executed repeatedly, returning subsets of data until the complete result is returned. This makes it easier to write queries for hierarchical data. Users can now run graph and tree expansions in one SQL statement, instead of writing complex application code. Recursive CTEs were one of the most frequently requested features in SingleStoreDB, and we can’t wait to see how customers make use of it in conjunction with SingleStoreDB’s multi-model capabilities. Read more: SingleStoreDB Documentation — WITH Recursive CTEs Laravel Driver Some of the largest websites in the world are built using the scripting language PHP. PHP has always been great for creating apps with dynamic content that require interaction with databases, and Laravel is a popular framework that simplifies PHP development. However, one hurdle for developers was  connecting Laravel with SingleStoreDB. A native driver for this connection did not exist, which means developers have had to use workarounds. One workaround was to use the MySQL connector for Laravel. However, since this was built for MySQL specifically, it does not support all of SingleStoreDB’s functionality. Some developers tried to work around this limitation as well by doing additional work —  like building tables that were specified as rowstore or columnstore, specifying things like shard and sort keys and making other adjustments for SingleStoreDB-specific functionality. To remove these hurdles, SingleStore has released a native driver for Laravel. Developers can now work on Laravel and create SingleStoreDB objects (such as tables or indexes) without needing tedious workarounds. The SingleStore Laravel driver is open-sourced and is available on Github where it has seen community  contributions, including from Franco Gilio, CTO of Publica.la. Additionally,  Fathom Analytics Co-Founder Jack Ellis is deeply involved with the Laravel community and has recently launched the online course SingleStore for Laravel. SingleStore’s native driver for Laravel is available for both Cloud and Self-Managed deployments. Read more: New Integrations for Laravel and Python Additional Resources GitHub — SingleStore Laravel DriverOnline course — SingleStore for Laravel (by Jack Ellis, CTO and Co-Founder of Fathom Analytics) Cognos Connector (Cloud and Self-Managed) IBM Cognos is an analytics software that has thousands of customers worldwide. However, to connect Cognos to SingleStoreDB, customers had to rely on MariaDB or MySQL connectors which were often unreliable, unsecure and resulted in connection errors. The problem was significant enough that some customers were willing to pay for third-party drivers (such as the one from Cdata). SingleStore has partnered with IBM to build a native SingleStoreDB integration, or connector, for IBM Cognos Analytics. The connector is built on the SingleStoreDB JDBC driver. The Cognos connector eliminates these issues and provides a fast, reliable and secure connection between SingleStoreDB and Cognos Analytics to power analytical applications. Read more: SingleStoreDB Documentation — Connect with IBM Cognos Analytics Playground The new SingleStoreDB Playground provides a free, open (no login needed) and easy-to-use “Hello World” experience to users unfamiliar with SingleStoreDB. Users can explore SingleStoreDB without the usual hurdles of learning a new database, finding and loading datasets, or writing SQL.  The Playground provides hands-on experience with armchair comfort. Users can run prewritten queries, or can choose to write and run their own queries on familiar datasets. The Playground includes data sets chosen to highlight our product’s strengths including transactions, analytics and multi-model capabilities. These data sets include: TPC-DS (analytics benchmark) TPC-H (transactions benchmark)Real-time digital marketing (application interaction)JSON data (multi-model) Already, hundreds of users have run thousands of queries in the Playground, gaining first-hand experience with the performance and speed of SingleStoreDB. Try the Playground here. ‘Guided tour’ onboarding experience Previously, users signing into the portal for the first time would need to get started and figure things out by themselves. They couldn’t easily experience or appreciate all of the powerful functionality of SingleStoreDB like streaming data ingestion, unified transaction and analytics, multi-model capabilities and more. Now, when users first create a workspace on the portal, they’re led to a guided tour of SingleStoreDB by simply checking a box. The tour takes them through a digital marketing use case complete with data sets, a user workflow, assistance on feature use and a look at the analytics in action. Pit stops of the tour include Pipelines, Rowstore and Universal Storage tables, Query Profiling, analytical queries and finally, a web application that runs on the same data users just created.
Read Post
Winter 2022 Release: SingleStoreDB Universal Storage, Part 5
Product

Winter 2022 Release: SingleStoreDB Universal Storage, Part 5

SingleStore Universal Storage is a single table type that supports analytical and transactional workloads. Universal Storage is our columnstore format, enhanced for seekability. JSON and String Seek and Update Performance Improvement  SingleStore has consistently demonstrated that you don’t need to have two different database systems to support transactions and analytics, respectively. We can do both — with a modern, distributed, scalable architecture. With our 8.0 release, we're delivering the fifth installment of Universal Storage, with support for fast seeking into JSON and string data.  Fast seeking also helps with update performance because before you update a row, you have to look it up. To recap from part four, the main benefits of Universal Storage are: Improved Total Cost of Ownership (TCO), since not all data needs to fit in RAMReduced complexity, because you can just use one kind of table instead of moving data from a rowstore to a columnstore yourselfImproved analytics performance on large tables, since query processing benefits from columnar, vectorized execution. What's New in 8.0 for Universal Storage In the 7.0, 7.1, 7.3 and 7.5 releases, we evolved our columnstore table type to do things only OLTP-style storage structures were supposed to be able to do. That includes: Subsegment access (fast seeking into columnstores to retrieve one or a few records when their position is known)Single-column hash indexesSingle-column unique indexes, constraints, and primary keysUpsert supportOption to set columnstore as the default table typeMulti-column hash indexesMulti-column uniquenessUpserts to tables with multi-column unique keysColumnstore as the default table type for new clusters by default In 8.0, we now support: Fast seeks into string dataA new, improved JSON data encodingFast seeks into JSON data We were a little overzealous in part four when we declared that Universal Storage was "done." Now, we're finished — with all but a few final touches. Keep reading to learn more about orders-of-magnitude performance gains that we've achieved for seeking into JSON and string data. Example: Speeding Up Seeks into a Table of Documents To show the improved performance of JSON seeking in 8.0, we created a table orders with this schema: create table orders( id int primary key not null, order_doc json, sort key(id)); The order_doc column has a document in it with some top-level properties and a collection of sub-documents, representing the lineitems for the order. This is a typical example of using semi-structured JSON data in SingleStoreDB. Here's an example row value: insert orders values (1, '{"id" : 1, "time" : "2022-11-17 23:03:54", "lineitems" : [{"line_num" : 1, "SKU" : 688968, "description" : "Apple iPhone 14 Plus", "qty" : 1} ,{"line_num" : 2, "SKU" : 6514052, "description" : "Energizer - MAX AA Batteries (8 Pack)", "qty" : 3} ,{"line_num" : 3, "SKU" : 6457697, "description" : "Star Wars - Galactic Snackin Grogu", "qty" : 1} ]} ' ); The orders table has 8*1024*1024 = 8,388,608 rows in it. We measured the runtime of this query: select id, order_doc into _id, _od from orders where id = _oid; This query retrieves all the fields of one row, specified by a random constant _oid. Running on an 8 core Macbook with an x86_64 Intel chip (for both the old and new JSON columnstore data formats) using SingleStoreDB 8.0.2, we observed these runtimes: JSON Format Avg Runtime(s) Old (8.0 with use_seekable_json = OFF) 0.5410 New (8.0 with use_seekable_= ON) 0.001277 Speedup 423x Yes, that's right — a 423x speedup! The speedup comes from not having to scan an entire million-row segment of data. Rather, once we know the row position within the segment, we simply seek into the column segments directly to get the bytes for the row. Since there's an index on id, we use that to identify the row position fast. See Appendix 1 for how to generate random orders data and try these queries yourself! You can also do this with a free SingleStoreDB trial run. New JSON Data Organization If you want to know how SingleStoreDB now stores JSON data in Universal Storage tables under the hood, then read on. Columnstore table columns containing JSON data are organized differently in 8.0 to enable fast row lookups. Prior to 8.0, we kept the data in a Parquet format internally. Over successive releases, we've made it possible to seek into number, date and string data using our own columnstore encodings (compression formats). But data in Parquet is stored differently, in a way that makes it difficult to seek into it — and perform other kinds of fast query execution on it. In 8.0, we use a new SingleStore-encoded Parquet (S2EP) data format. It stores data such that: Top-level JSON properties are stored similar to regular columns, with each property stored column wise. And, the data uses the same encodings that we use for other data types including numbers, strings and dates.Nested Properties (inside sub-documents) are also stored similar to regular columns. For example, if a property x.y.z exists, via navigation from a top-level document x to a sub-document y to a property z, it will be stored in a sequential columnar format using our standard encodings.Arrays are pivoted and stored in columnar format. An array of numbers that is a property of an enclosing document would be stored as a logical column, with multiple values in the column belonging to the same row. Array length information is kept to enable SingleStoreDB to identify what values below to what row.Very sparse columns are stored in a catch-all area, and not stored in encoded columnar format. By sparse, we mean they don't occur in very many documents, i.e. in << 1 of 10,000 documents. Because we use the same encodings we use for regular data, we are now able to seek into that data using the same approach we used in earlier releases for standard columns. By using our standard encodings, we are also laying the groundwork for future performance improvements, like operations on encoded data applied directly to JSON property data. Because of this new feature, there will be no need to even consider promoting JSON properties to persisted computed columns moving forward — as is sometimes done today to improve performance. It works well, but of course it comes with additional complexity. Improved Transaction Processing Speed for String Data Universal Storage tables are broken into million-row segments. Given a (segment_id, row_number) pair, we can seek to find the row within the segment, rather than scan the segment from beginning to end to find the data for the given row_number. This capability first arrived in SingleStore 7.0 in 2019. It's illustrated in Figure 1.
Read Post
[r]evolution Summer 2022: Bring Application Logic to Your Data With SingleStoreDB Code Engine for Wasm
Engineering

[r]evolution Summer 2022: Bring Application Logic to Your Data With SingleStoreDB Code Engine for Wasm

SingleStoreDB Cloud now supports user-defined functions written in C, C++ and Rust with our new Code Engine — Powered by Wasm. Application developers can now use libraries of existing code, or code from their applications, in database functions and call them from SQL. That can tremendously reduce the need to move data from the database to the application to analyze it, speeding things up and simplifying development. We illustrate the power of Code Engine for Wasm with a story about a developer who finds a new way to create an important sentiment analysis report — easier and faster. You're an application developer for an eCommerce company, and you've been asked several times to produce reports that do sentiment analysis on the review comments for products your company has sold. It is getting mighty tedious because your sentiment analysis code lives in a Rust program, so every report requires you to export data to the app and score it there. Then, you have to write query processing-style logic in your app to create the report. It's honestly kind of fun, but you — and your boss — think your time would be better spent elsewhere. That's when you learn from your DBA that your database, SingleStoreDB, has a new feature called Code Engine — Powered by  Wasm that lets you extend the database engine with user-defined functions (UDFs) in C, C++ or Rust, and other languages soon. It dawns on you that you can take the sentiment analysis code from your application and move it into the database as a Wasm UDF. That means those reports people have been asking for can be created using a pretty straightforward SQL SELECT statement. Better yet, you can teach the analysts to write the queries themselves. Then, you don't even need to hear about it! We'll hear more of this story later!
Read Post
Toys for Developers in SingleStoreDB: Spring 2022 Release
Product

Toys for Developers in SingleStoreDB: Spring 2022 Release

I used to pump gas at a Shell station in Oregon, where they still fill your tank for you. I patched the occasional tire and changed some belts and hoses too, taught by two great mechanics who could fix anything. Those guys were pros — tough and skillful men. But there was one thing that always turned them into little kids: the arrival of the Snap-on tool truck — the mechanics' toy store on wheels!  You database application developers are makers and fixers too. You're grown up. But what grown-up builder doesn't like their toys? We're proud to announce the arrival of the SingleStoreDB tool truck for Spring 2022!
Read Post
Flexible Parallelism in SingleStoreDB
Product

Flexible Parallelism in SingleStoreDB

Veteran SingleStoreDB users take our elasticity for granted. SingleStoreDB is a true distributed database. It's easy to scale up, so you can handle increasing demands to process queries concurrently — for both transactions and analytics.  Many application developers are still using single-box database systems and sharding data manually at the application level. That's not necessary with SingleStoreDB, which can save you many person-years of development time — letting you focus on the app, and not the data platform. Although we can stretch and grow with your workload, we're pushing to make our elastic scale easier and more powerful. With the latest release of SingleStoreDB, we've taken a big step toward our ultimate goal of transparent elasticity with our flexible parallelism (FP) feature. FP is the premier feature in this release, and is part of our long-term strategy to make SingleStoreDB the ultimate elastic relational database for modern, data-intensive applications. Flexible parallelism SingleStoreDB uses a partitioned storage model. Every database has a fixed number of partitions, defined when you create the database. Historically, SingleStoreDB has done parallel query processing with one thread per partition. This works great when you have an appropriate partitioning level compared to the number of cores, such as one core for every one or two partitions.  With prior releases, when you scaled up, you couldn't use the extra cores to run one query faster. You could only benefit from them by running more queries concurrently. FP changes that. Flexible parallelism works by internally adding a sub-partition id to each columnstore table row. Then, query execution decides a set of sub-partitions for each thread to scan on the fly. The use of sub-partitions allows shard key matching queries to work. To make this efficient, the sub-partition id is added internally to the beginning of the sort key.
Read Post
Table Range Partitioning Is a Crutch. Here’s Why SingleStore Doesn’t Need It
Engineering

Table Range Partitioning Is a Crutch. Here’s Why SingleStore Doesn’t Need It

While antiquated legacy databases rely on table range partitioning to manage data lifecycles, modern databases built for data-intensive applications — like SingleStore — have functionalities that eliminate the need for unnecessary, outdated processes (like range partitioning).Table range partitioning is a crutch that legacy databases use to help them handle the lifecycle of bulk data additions and removal. SingleStore doesn't need range partitioning, mainly because it can delete data so fast. Using bulk loading (SingleStore's LOAD DATA or PIPELINES) plus SQL INSERT and DELETE to manage the lifecycle of data is much easier than using range-partitioned tables and partitioning switching operations. SingleStore's speed at adding and removing data makes it a joy to work with, compared to partitioning operations in legacy database systems.You can stop reading now.But seriously, you may be thinking that you'd like to see some of the reasoning behind this, and some proof that we're as fast as I'm saying. Here goes.Range partitioning exists for these reasons in legacy systems:It allows you to swap out a range of data fast, from an existing table to a target table with a metadata-only operation. Then you can truncate the resulting target table to remove the data. Truncating in the legacy systems is the only way to quickly delete data.It allows you to swap in a range of data fast, from a staging table to a target table. This allows you to prepare data in a staging table before putting it into a table where it will be queried.It provides a coarse form of indexing, since data is in strictly ascending order from earlier to later ranges. That means you can partition three years of data into 36 one-month ranges.Different partitions can be stored on different storage devices, allowing you to add new devices as your data grows to hold the increasing volume of data.Let's look at why SingleStore can address the above requirements without range partitioning — and why it's much easier to use SingleStore to manage the data lifecycle than it is with legacy range partitioning.First; It allows you to swap out a range of data fast: SingleStore does not need to be able to swap out a range of data fast because it can delete data phenomenally fast. Stay tuned for proof of that below.Second; It allows you to swap in a range of data fast: There's no need to swap data into a SingleStore table because we can load millions of records per second using INSERT operations (say, several threads inserting data in 100-row batches) or SingleStore PIPELINES.Third; It provides a coarse form of indexing: SingleStore doesn't need partitioning to coarsely index data because columnstore sort keys can be used to order the data by what you would have used as the partitioning key in a legacy system (e.g., the date_key or a datetime column).Finally; Different partitions can be stored on different storage devices: SingleStore allows you to add more storage via a partitioned (sharded) storage model, where you can add more nodes to a cluster (of course, these nodes have their own additional storage) and rebalance your data. So there's no need to be concerned about keeping specific ranges of data on specific devices.SingleStore DELETE SpeedA common bulk operation that people do on large tables that is related to lifecycle management is to remove old data.Here's an example of how fast SingleStore can remove old data. I created the following table on a SingleStore S8-size cluster on our managed service:Table: lineitem2Columns: 17Data size: 1,254,492,160 rowsAn S8 has 8 units, which is 64 total cores and 512 GB RAM.Then, I deleted the oldest 116,325,376 rows (about 9.3% of the data). This is similar to deleting the oldest month from a table with a year of data.This took 0.2081 seconds.Yes, you read that right — 0.2081 seconds. Not minutes, not hours, not days.With delete speeds like this, who needs the complexity of partitioning? Not SingleStore users.SummaryIf you think you need table partitioning and you're using SingleStore, ask yourself why. If it's to speed up bulk removal of old data, we think you'll be happy if you just use DELETE instead because it's fast and easy. If it's to speed up query processing by keeping the data in order, you can just use a SORT key. There's really no need for range partitioning in SingleStore.Want to test this out for yourself? Get started with your free trial of SingleStore today — and say goodbye to table range partitioning for good.Appendix: Script for Delete TestIf you want to reproduce this yourself, you can run this script on a size S8 cluster. Or, choose a different sizer cluster, modify the script, and experiment with it.RESTORE DATABASE memsql_demo FROM s3 "s3://memsql-demo/" CONFIG '{"region":"us-west-2"} ';use memsql_demo;-- verify size of lineitem is 1,225,090 rowsselect format(count(*),0) from lineitem;-- create a columnstore version of the lineitem table, calling it lineitem2CREATE TABLE `lineitem2` ( `orderkey` bigint(20) NOT NULL DEFAULT '0',`partkey` int(11) DEFAULT NULL, `suppkey` int(11) DEFAULT NULL,`linenumber` int(11) NOT NULL DEFAULT '0', `quantity` decimal(20,2) DEFAULT NULL,`extendedprice` decimal(20,2) DEFAULT NULL,`discount` decimal(3,2) DEFAULT NULL, `tax` decimal(3,2) DEFAULT NULL,`returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,`linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,`shipdate` date DEFAULT NULL, `commitdate` date DEFAULT NULL,`receiptdate` date DEFAULT NULL,`shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,`shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,`comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,`created` timestamp NULL DEFAULT CURRENT_TIMESTAMP,SHARD KEY (`orderkey`,`linenumber`),SORT KEY(created));-- seed some data from lineitem into lineitem2insert into lineitem2select * from lineitem;-- append lineitem2 to itself until it has more than a billion rowsdelimiter //dodeclare c bigint;beginselect count(*) into c from lineitem2;while c < 1000*1000*1000 loop insert into lineitem2 select * from lineitem2; select count(*) into c from lineitem2;end loop;echo select format(count(*),0) from lineitem2;end //delimiter ;-- verify row countselect format(count(*),0) from lineitem2;-- flush table to make it all in columnstore format-- (not strictly necessary, but for-- demonstration purposes it makes sure we are only observing columnstore-- delete speed)optimize table lineitem2 flush;-- There are about 40 different "created" times. This query shows-- total rows for each, and running total, ordered by "created"with t as(select created, count(*) cfrom lineitem2group by created)select row_number() over (order by created), t.created, c, sum(c) over (order by created) as cumfrom t order by created;-- Now, copy out the timestamp for about 116 million rows,-- which is: 2021-08-23 18:37:33-- Use this in the DELETE statement to remove around 9.3% of the rows.-- Run an equivalent delete command to what we want to measure,-- so the plan gets compiled,-- and thus we won't be measuring compile time. The time chosen is-- to be before all the data, so this doesn't delete any rows.delete from lineitem2 where created <= "2020-08-23 18:37:33";-- run the DELETE and get the before & after times and subtract, and scale-- to show total time in secondsselect now(6) into @ts1;delete from lineitem2 where created <= "2021-08-23 18:37:33";select now(6) into @ts2;select timestampdiff(microsecond,@ts1, @ts2)/1000000.0 as secs;
Read Post
SingleStore Adds Mission-Critical Capabilities for Enterprise Applications
Product

SingleStore Adds Mission-Critical Capabilities for Enterprise Applications

You won’t believe the technology that goes into making SingleStore the fastest modern cloud database. And our latest features bring mission-critical capabilities to developers of all kinds, from enterprise database engineers to SaaS application developers. In this post, Micah Bhatki and Eric Hanson from SingleStore’s Product Management team reveal what’s new.SingleStore’s Mission-Critical Capabilities for Enterprise ApplicationsEric Hanson / Micah BhaktiSingleStore’s newest capabilities allow us to take on new mission-critical workloads. As a developer, you will have the control you need to feel confident running your mission-critical, data-intensive applications on SingleStore. And SingleStore can be deployed anywhere you run your applications, on AWS, Azure, and GCP, or self-hosted on your own infrastructure with unmatched performance, scalability, and durability.SingleStore’s unique patented Universal Storage delivers the separation of storage and compute, and has enabled the following features in a new product edition, SingleStore Premium, to complement the existing SingleStore Standard. SingleStore Premium is designed for mission-critical applications that have stringent requirements for availability, auditability, and recovery.Key Mission-Critical Features included in Premium:Point-In-Time RecoveryMulti-AZ FailoverResource GovernanceAudit LoggingHIPAA Compliance99.99% AvailabilitySilver SupportThese capabilities allow enterprises running mission-critical internal and customer-facing applications with the most stringent requirements to guarantee availability, durability, and auditability of all of their information, while delivering the performance and scalability SingleStore is known for.This article explains these and other performance and usability improvements now available in SingleStore.Point-in-Time RecoveryFor years, SingleStore has supported transactions via log-based recovery, multi-version concurrency control, lock-based write/write synchronization, built-in high-availability, and disaster recovery. But customers and prospects have asked for one feature in particular to help them recover to a consistent database state in the event of data corruption—Point-in-time recovery, or PITR.PITR is now generally available after previously debuting as a preview. It allows you to recover a database to a transaction-consistent state at any point in time, down to datetime(6) resolution, or to any named milestone you created in the past.For example, suppose you deployed an application change at 2:00 am and discovered at 2:30 am that it corrupted the database. You can detach the database, and reattach it as of 2:00 am, to get the data back to the consistent state it was in before you deployed the application change.Moreover, even though SingleStore is a distributed database, with separate transaction logs for each partition, PITR uses a new technology called global versioning that allows the system to bring the database to a transaction-consistent state after recovery.Point-in-time recovery works by:recovering the latest snapshots taken just before the point in time desired, which contain rowstore data and metadata that references columnstore blob files,playing back the log files to the desired point in time or milestone.This is illustrated in the following diagram:
Read Post
SingleStore’s Patented Universal Storage - Part 4
Product

SingleStore’s Patented Universal Storage - Part 4

SingleStore Universal Storage is a single kind of table that can support analytical and transactional workloads. For decades, many believed that you needed special databases and technology for analytical vs. transactional workloads. SingleStore has proved that wrong. With our 7.5 release, we're delivering the fourth installment of Universal Storage Technology, with support for multi-column keys, and making columnstore the default table type for new installations. We're proud to say all the major features of Universal Storage are done. This is the last in a series of four articles that describe SingleStore's unique, patented Universal Storage feature. Read Part 1, Part 2 and Part 3 in the series to learn the whole story. The main benefits of Universal Storage are: (1) Improved Total Cost of Ownership (TCO), because data doesn't need to all fit in RAM for typical operations that involve upserts or need unique key enforcement. That reduces cost by not requiring servers with very large RAM, which can be expensive. (2) Reduced complexity, because now you don't have to do some operations on a rowstore and then move data to a columnstore, delivering speed and performance not possible before. (3) Improved analytics performance on large tables when combined with upserts, unique key enforcement, fast lookups, and other OLTP-style operations. That's because analytical queries can process hundreds of millions of rows per second on columnstore tables on just a single core, whereas peak performance per core on rowstores is about 20 million rows per second. What's New in 7.5 for Universal Storage In the 7.0, 7.1, and 7.3 releases, we evolved our columnstore table type to do things only OLTP-style storage structures were supposed to be able to do. That includes: subsegment access (fast seeking into columnstores* to retrieve one or a few records when their position is known)single-column hash indexessingle-column unique indexes, constraints, and primary keysupsert supportoption to set columnstore as the default table type *This new columnstore table type is what we call Universal Storage. But you'll still see it called "columnstore" in our syntax. In 7.5 we now support: multi-column hash indexesmulti-column uniquenessupserts to tables with multi-column unique keyscolumnstore as the default table type for new clusters by default From a functional perspective, Universal Storage is done now. We don't expect to add more surface area to it. The performance is great, but we can make it even better, so stay tuned for advances on that front in future releases. Examples The examples covered in this article are: loading 28 million rows of data into the lineitem table for TPC-Hmoving that data into a table with a unique key on it that has two columnsshowing that this unique key is enforcedseeking on this unique key and show how fast that is The lineitem table is realistic, similar to one you might find to support ecommerce applications doing transactions, analytics, or both. Uniqueness enforcement is a common requirement for all kinds of applications. Having the database automatically enforce uniqueness is of course beneficial since it automatically ensures data integrity and frees the application developer from enforcing uniqueness themselves with application code. Loading into the lineitem table SingleStore Studio and the S2MS management console both allow you to easily load the TPC-H data set at the scale factor 100. Here, I just load part of the data, and stop loading it after I get to about 28 million rows of lineitem data, to make it a bit quicker to get started. I actually just copied part of the example that loads the TPC-H data from Studio. To repeat this test yourself, run the following commands: DROP DATABASE IF EXISTS tpch; CREATE DATABASE tpch; USE tpch; CREATE TABLE `lineitem` ( `l_orderkey` bigint(11) NOT NULL, `l_partkey` int(11) NOT NULL, `l_suppkey` int(11) NOT NULL, `l_linenumber` int(11) NOT NULL, `l_quantity` decimal(15,2) NOT NULL, `l_extendedprice` decimal(15,2) NOT NULL, `l_discount` decimal(15,2) NOT NULL, `l_tax` decimal(15,2) NOT NULL, `l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `l_shipdate` date NOT NULL, `l_commitdate` date NOT NULL, `l_receiptdate` date NOT NULL, `l_shipinstruct` char(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `l_shipmode` char(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, SHARD KEY (`l_orderkey`) USING CLUSTERED COLUMNSTORE ); CREATE OR REPLACE PIPELINE tpch_100_lineitem AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/lineitem/' config '{"region":"us-east-1"} ' SKIP DUPLICATE KEY ERRORS INTO TABLE lineitem FIELDS TERMINATED BY '|' LINES TERMINATED BY '|\n'; START ALL PIPELINES; Now, you can run this query on another session to see the progress of the pipeline: SELECT CONCAT(PIPELINE_NAME) AS pipelineId, sub.BATCH_STATE AS lastBatchState, IFNULL(sub.BATCH_ROWS_WRITTEN, 0) AS lastBatchRowsWritten FROM ( SELECT DATABASE_NAME, PIPELINE_NAME, BATCH_STATE, BATCH_ROWS_WRITTEN, ROW_NUMBER() OVER ( PARTITION BY DATABASE_NAME, PIPELINE_NAME ) AS r FROM INFORMATION_SCHEMA.PIPELINES_BATCHES_METADATA WHERE BATCH_STATE NOT IN ('No Data', 'In Progress') ) sub WHERE r = 1 AND DATABASE_NAME='tpch' ORDER BY pipelineId ASC; When enough data has loaded (say about 28 million rows) you can stop the pipeline so it won't take as long and use as much space (there are about 600 million rows total available): stop all pipelines; Now check the size of the table: select format(count(*), 0) from lineitem; Create lineitem_uk table with a two-column unique key Now, we'll create a different version of the table with a primary key: set global default_table_type = 'columnstore'; create table `lineitem_uk` ( `l_orderkey` bigint(11) NOT NULL, `l_partkey` int(11) NOT NULL, `l_suppkey` int(11) NOT NULL, `l_linenumber` int(11) NOT NULL, `l_quantity` decimal(15,2) NOT NULL, `l_extendedprice` decimal(15,2) NOT NULL, `l_discount` decimal(15,2) NOT NULL, `l_tax` decimal(15,2) NOT NULL, `l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `l_shipdate` date NOT NULL, `l_commitdate` date NOT NULL, `l_receiptdate` date NOT NULL, `l_shipinstruct` char(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `l_shipmode` char(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, shard key (l_orderkey), sort key(l_shipdate), primary key(l_orderkey, l_linenumber) ); Now, run this to move the data over to lineitem_uk: insert into lineitem_uk select * from lineitem; Verifying that Uniqueness is Enforced First, let's verify that uniqueness is really being enforced. If you run this: insert into lineitem_uk select * from lineitem limit 1; You'll get this error: ERROR 1062 ER_DUP_ENTRY: Leaf Error (172.17.0.2:3308): Duplicate entry '19805284-1' for key 'PRIMARY' That's what we'd expect, since we know we already put that row into lineitem_uk once before. Profiling Performance of a Multi-column Unique Key Lookup  Now, let's profile a seek query that looks up one row via a key. (If you stopped the loading early, you may need to verify that a row with l_orderkey = 364000000 is present; if it's not there, pick another l_orderkey value that is there, and modify the query below to use that value instead.) Every order had a l_linenumber 1 value, so that's always a safe choice. Now, run the single-row lookup: select * from lineitem_uk where l_orderkey = 364000000 and l_linenumber = 1; For me, the profile took about 2 milliseconds to complete the ColumnStoreScan operator, which actually does a seek using the primary key index on (l_orderkey, l_linenumber). I'm using a new MacBook with 8 cores. You can profile in Studio by clicking on the "..." on the upper right of the SQL pane and selecting the profile option. If you mouse over the ColumnStoreScan operator, it will show how long it took. You'll have to run it twice to get accurate results Seek time of 2 milliseconds into a columnstore table that's moderately wide, like this one, is quite fast, as expected (I showed that in one of my earlier Universal Storage blogs). This is OLTP-level speed on a columnstore. Upsert on a Multi-column Key Columnstore One of the most common issues customers have had in the past when ingesting data into SingleStore is that they have a streaming upsert workload, and they had to first load the data into a rowstore to accomplish that. They'd then migrate it to a columnstore after it stabilized, say after a few days, for long-term storage, to save money on RAM since rowstores must be in RAM. Queries would have to be modified to retrieve data from the columnstore and the rowstore and combine the information. This was extra work for developers. Many of the upsert operations in these kinds of workloads depend on a multi-column unique key. Until SingleStore 7.5, you could not upsert to Universal Storage (columnstore) directly if you had a multi-column unique key. We're happy to say that now we can handle multi-column upsert workloads directly into a columnstore table. That means you no longer need to migrate data from a rowstore to a columnstore or perform a potentially non-trivial combination of data from two tables, a "hot" rowstore and a "long term archival" columnstore. To help understand the examples that follow, a good reference on the upsert-style operations supported in SingleStore is here. The most basic kind of conditional insert is not really an upsert at all, but rather INSERT IGNORE. As an example of this, suppose we have a staging table with two rows in it, created like so: create table upsert_records as select * from lineitem_uk limit 2; When I ran this, I got records with these keys:
Read Post
Why You Should Use a Scale-out SQL DBMS Even When You Don't Need One
Data Intensity

Why You Should Use a Scale-out SQL DBMS Even When You Don't Need One

Everybody has heard of the KISS principle -- Keep It Simple, Stupid. When applied to data management technology, it implies that you should use a simple solution when it works. Developers often believe that means using a single-node DBMS like MySQL, PostgreSQL, or SQL Server. So, what is a “single-node” database? Essentially, it’s a database designed for a single machine. The capacity of a single machine dictates the resource constraints of processing power, connections, and storage. If you need more processing power or storage, you can vertically scale, meaning you upgrade to a more powerful machine. This can work, up to the scale limits of the largest available machine. By contrast, scale-out databases are built as distributed databases from the start, that is, designed to be run across machines. Sometimes people see scale-out relational database technology, like SingleStore, as strictly for high-end applications, beyond what one single-node DBMSs can handle. The thinking goes that they are inherently more complex to use than a single-node DBMS. I'm here to tell you that the KISS principle is right. But people who think it means you should only use scale-out for "extreme" apps are dead wrong. Let's look at the arguments for why scale out is not "simple," and address them one by one. We'll see that in many cases, scale out actually makes things simpler. It's hard to get and set up and manage the hardware and software. First, database platform-as-a-service (PaaS) offerings like SingleStoreDB Cloud handle all that for you. You can choose a size and dial up and down whenever you need to. For self-hosting, using the public cloud or a well-organized enterprise data center with a range of hardware SKUs means you can pick out and provision machines fairly easily. My single-node database is fast enough. For some problems with small data, a single-node DBMS may be fast enough. But there are thousands of applications out there with medium-to-large data on a single node system. The people who built them may think they are fast enough, but what if they could run queries instantaneously? Research shows that response time under ¼ second feels instantaneous, and that generates incredible user satisfaction. This drives users to explore more freely, learning more about the data, which helps them make better decisions. Your single-node DBMS might be able to provide near-instant responses for a few users, but what if there are many users? Enterprises are pursuing digital transformation initiatives to get more out of the data they have, not just scale to handle bigger data sets. The levels of speed and concurrency you can get from scale-out enable new applications that unlock data's value, enabling digital transformation. If my problem gets big, I can just add more nodes of my regular RDBMS.  A common pattern for scaling applications is to create multiple databases on multiple nodes using a single-node DBMS. This could be done in a number of ways, such as\ (a) putting each customer's data in a different database with the same schema, and spreading those databases across multiple nodes or\ (b) creating replicas of the same database to scale out the workload.\ Either way, this is anything but simple because you have to decide at the application level how to split up your data. Moreover, there might be a situation where you need more than one node to handle the workload for a single database. At that point, your single-node database runs out of steam. My problem is not big enough to benefit from scale out.  You'd be surprised about how small an application can be and still benefit from scale out. Here are a couple of examples. First, SingleStore has a customer with a few billion rows of data in a data mart that would easily fit on a single-node database. But they are extending quarter-second response time for dashboard refreshes to several hundred concurrent users, with intra-day updates, enabling a complete digital transformation in how the data is consumed. It's sort of a real-time data mart. As a second example, we have customers that have less than a million rows of data but are using brute-force searches of vector data for AI image-matching applications using DOT_PRODUCT and EUCLIDEAN_DISTANCE functions in SQL. This brute force approach gives them better match fidelity than multi-dimensional vector indexing that's not available in SQL DBMSs to date, and still let's them integrate their match queries with other SQL query constructs. And see the later discussion about using brute-force scale out to simplify away the need for complex solutions like pre-calculated aggregate tables. Plenty of people with only a few hundred thousand rows of data can benefit from that. It's hard to design my scale-out database to get it to perform. Yes, there are a couple of new concepts to learn with a scale-out database, mainly sharding (for partitioning data across nodes) and reference tables (for duplicating small dimension tables onto each node). But the horsepower you get from a good, high-performance scale-out database actually simplifies things. E.g. you may need materialized views or pre-calculated summary aggregate tables with a single-node database but not with a scale-out database. Pre-calculated aggregates and materialized views are tricky to use and introduce design problems that are conceptually harder than deciding how to shard your data. If you know when to use an index, you can learn how to shard your data and use reference tables in a few minutes. And you don't have to get it right the first time; it's easy to reconfigure (create a new table with the configuration you want, INSERT...SELECT... data into it, drop the old one, rename the new one, and you're done). I can't find people with the skills to use a scale-out DBMS.  Modern SQL-based scale-out databases are based on standard SQL, and are often compatible with MySQL or Postgres. SingleStore is largely compatible with MySQL, for example. So hundreds of languages and tools can connect to these SQL-based scale-out databases. And almost all of the SQL skills people have from working with SQL databases like MySQL and Postgres are transferable. I want to use a general-purpose database and there are no general-purpose databases with scale out that work for me. A general purpose tool simplifies life, that's true, by making skill sets applicable to multiple problems, and reducing the effort to search for the right tool. Fortunately, there is a general-purpose SQL database that scales out and runs anywhere. I think you know what database that is. I could go on, but you get the idea -- the spartan simplicity of relational databases and SQL carries its benefits over to scale-out SQL systems. And scale out simplifies lots of things, and enables digital transformation opportunities that can be valuable to your business and your career. There's a corollary to the KISS principle that applies here also, often attributed to Albert Einstein: "Everything should be made as simple as possible, but no simpler."  In this context, that means you shouldn't give up on really valuable application innovations made possible by the performance you can get from scale out, due to perceived complexity. Finally, scale out is to improve speed and scalability. SingleStore scales out, but it also has other important technologies to make things faster, including in-memory row store tables, columnstores, compilation of queries to machine code, vectorization, and a high-performance plan cache. All of these things squeeze the most out of each processor core in your system. So, next time you are about to reach for your trusty single-node DBMS, ask yourself, can I reach higher, and do more, and keep it simple at the same time?
Read Post
SingleStore’s Patented Universal Storage - Part 3
Product

SingleStore’s Patented Universal Storage - Part 3

SingleStore Universal Storage technology is a dramatic evolution of our columnstore data format that allows it to support both operational and analytical workloads with low total cost of ownership (TCO). People love our rowstores for OLTP-style access patterns because of our great performance based on memory-optimized data structures and compilation of queries to machine code. But for users with large data sets, the cost of providing servers with enough RAM to hold all the data started to be significant. Universal Storage solves this cost problem, while also providing even better analytical performance via query execution that takes advantage of columnar storage formats, vectorized execution, and single-instruction multiple-data (SIMD) instructions.This is the third in a series of four articles that describe SingleStore's unique, patented Universal Storage feature. Read Part 1, Part 2 and Part 4 in the series to learn the whole story.Our 7.0 release introduced universal storage and the 7.1 release enhanced it. Now, we’re delivering Episode 3 of universal storage in our 7.3 release. In this installment, we’ve added:Columnstore can be made the default table type by setting a new engine variable default_table_type to ‘columnstore’UPSERT support for columnstore tables with single-column unique keys, includingINSERT ON DUPLICATE KEY UPDATEINSERT IGNOREREPLACEAnalogous capabilities for Pipelines and LOAD DATASupport for unique key enforcement for very large INSERTs and UPDATEsNow, arbitrary INSERTs and UPDATEs, regardless of the number of rows updated, will succeed. In 7.1, if more than, say, 20-30 million rows were updated in a single statement on a small node, the operation could fail, due to an out-of-memory condition.Please see this video to learn more about our universal storage update in SingleStore 7.3:
Read Post
SingleStore’s Patented Universal Storage - Part 2
Product

SingleStore’s Patented Universal Storage - Part 2

Universal Storage, first introduced in the SingleStoreDB Self-Managed 7.0 release, advances quickly in SingleStoreDB Self-Managed 7.1. Universal Storage allows SingleStore to support the world’s largest workloads – as well as many smaller workloads – with extremely high performance and excellent price-performance. In this release, new Universal Storage features make it even easier to support blended analytical and transactional workloads that achieve near-rowstore performance for selective queries, but at columnstore costs.This is the second in a series of four articles that describe SingleStore's unique, patented Universal Storage feature. Read Part 1, Part 3 and Part 4 in the series to learn the whole story.SingleStore Universal Storage appeared in SingleStoreDB Self-Managed 7.0, with notable improvements in performance and total cost of ownership (TCO) for both rowstore and columnstore workloads. In SingleStoreDB Self-Managed 7.1, we make further improvements to rowstore-like aspects of performance on the columnstore side.  Support for high performance hybrid transactional and analytical processing (HTAP) workloads is becoming increasingly important in the database industry. Industry analysts refer to this kind of workload as “translytical,” “operational analytics,” “augmented transactions,” or “analytic/augmented transaction processing (ATP).”  We’ll use the term HTAP here.  SingleStore leads the industry for HTAP performance, outdistancing legacy products by a factor of ten or more in price-performance. But we’re not satisfied. We want to support the world’s largest workloads cost-effectively — more cost-effectively than anyone, even SingleStore, has achieved to date. SingleStore shipped the first installment of our groundbreaking Universal Storage technology in SingleStoreDB Self-Managed 7.0 last December. The purpose of Universal Storage is to dramatically reduce total cost of ownership (TCO) for HTAP workloads. Moreover, Universal Storage technology can allow much larger online transaction processing (OLTP) workloads to be run on SingleStore, with much lower TCO, than ever before. In the SingleStoreDB Self-Managed 7.1 release, we’ve improved Universal Storage to handle more use cases more efficiently, and make the developer’s job easier when creating both HTAP and OLTP applications.Universal Storage Introduced in SingleStoreDB Self-Managed 7.0Beginning with its introduction in SingleStoreDB Self-Managed 7.0 [SS19], Universal Storage has included several capabilities which improve TCO and performance for HTAP and OLTP:Hash indexes on columnstores [CS19], to allow fast location of a row in a columnstore, given its key value.Sub-segment access, which allows quick retrieval of a row from a columnstore – very quickly, in single-digit milliseconds – once its position is known.Row-level locking for columnstores, which allows many concurrent updates of multiple rows in a columnstore to proceed, without requiring transactions to wait.SPARSE compression for rowstores [RS19], which can cut RAM usage by half or more for wide tables in rowstore that have lots of NULL values (ie, many of the rowstore tables we see at SingleStore).The first three changes make it possible to accomplish many tasks in columnstore that were formerly only practical in rowstore. This allows customers to take advantage of the high degree of compression in columnstore, and the costs advantages of using this disk-based table type, to achieve previously impossible TCO for these workloads.Universal Storage Advances in SingleStoreDB Self-Managed 7.1The advent of sparse compression in rowstores preserves the speed advantages of rowstore tables, based as they are in memory, while cutting costs by roughly 50%. Also, together, these changes reduce the need to use mixed rowstore/columnstore implementations, in an effort to find price/performance sweet spots. (At the expense of added complexity.) The desired sweet spot can now often be found either entirely in rowstore, or entirely in columnstore.  And now, as part of SingleStoreDB Self-Managed 7.1, we are making further improvements to Universal Storage on the columnstore side. These changes speed up more columnstore operations, enhancing the degree to which you can enjoy the large price advantages of columnstore (5-10x compression, and the use of disk rather than memory for main storage), along with the existing performance advantages of columnstore (fast scans, for example), and new performance improvements for specific operations in columnstore that bring it ever closer to rowstore-like execution times.The Next Installment of Universal StorageWe’ve worked with many of our customers to understand how they’re using our product for HTAP. Something they’ve consistently asked for is enhancements to the ability to enforce uniqueness constraints automatically. They can already do this with SingleStore rowstores, which have supported unique key validation for years.  In addition, in SingleStoreDB Self-Managed 7.0, with hash indexes and subsegment access on columnstores, you could use multiple statements to check for an existing key, then insert a new record with that key if the key was not found. But clearly, it made sense to support standard SQL UNIQUE constraints or keys on a columnstore table. That would make the developer’s job easier.Unique Hash KeysSo, in SingleStoreDB Self-Managed 7.1, we will now support single-column unique keys on columnstores, via an extension of our existing hash indexes. Here’s a simple example of how it works. First, we create a table, `t`, with a unique key, column `a`.create table t(  a int,   b decimal(18,5),   shard(a),   unique key(a) using hash,   key(a) using clustered columnstore);The clause, `unique key(a) using hash`, causes SingleStore to validate inserted and updated rows, making sure no duplicates are added to column `a`.  You must shard the table on the unique key so the uniqueness test can be done locally on a single leaf node [MSL19].  Now, we insert two rows with different keys:memsql> insert t values(1, 10.0);Query OK, 1 row affected (0.11 sec)memsql> insert t values(2, 20.0);Query OK, 1 row affected (0.01 sec)Finally, we try to insert a duplicate key, `2`:memsql> insert t values(2, 30.0);ERROR 1062 (23000): Leaf Error (127.0.0.1:3308): Duplicate entry '2' for key 'a_2'This fails because of the duplicate key.Performance of Uniqueness CheckingTo analyze the performance level of the hash index uniqueness checking, I inserted 16 million rows in the table, `t`, used above. Then I ran this statement to insert 1000 new rows:insert into tselect a+(select max(a) from t), 1000000*rand()from tlimit 1000;Running this command using the Profile option in SingleStore Studio (Profile) shows this took 39 milliseconds, which is a fraction of a millisecond per row. Profile shows that the following command to insert one row takes less than one millisecond:insert into tselect a+(select max(a) from t), 1000000*rand()from tlimit 1;Both of these INSERT statements are showing OLTP-level performance for uniqueness checking.Highly-Selective Joins on ColumnstoresSingleStoreDB Self-Managed 7.0 introduced support for columnstore hash indexes, broadening the support for OLTP-type queries on columnstores. However, a common join pattern in OLTP is to have a very selective filter on one table, which produces one or a few rows from the source table, and then join those rows with another table. Databases for OLTP normally use a nested-loop join for this. For each and every row from the outer table, an index seek will be done on the inner table.  SingleStoreDB Self-Managed 7.1 supports this kind of highly selective join using an adaptive hash join algorithm, which first does a hash build for the table with the highly-selective filter. Then, if only a few rows are in the hash table, it switches to perform a nested-loop join that seeks into the larger table (on the probe side) via the index on the join column of the table on the probe side. If, on the other hand, the hash build side produces a lot of rows, then a normal hash join will be done. Here’s an example of a simple schema and query that can take advantage of this new strategy for selective joins.create table orders(  oid int,   d datetime,   key(d) using clustered columnstore,   shard(oid),   key(oid) using hash);create table lineitems(  id int,   oid int,   item int,   key(oid) using clustered columnstore,   shard(oid),   key(oid) using hash);Now, add some sample data to orders:insert into orders values(1, now());-- repeat the statement below until orders has 33.5 million rowsinsert into orders select oid+(select max(oid) from orders), now()from orders;Add 67.1 million rows of data to `lineitems`, such that each line item belongs to an order, and each order has exactly two line items.insert into lineitems select oid, oid, 1 from orders;insert into lineitems select oid + 1000*1000*1000, oid, 2 from orders;Find a selective datetime value for d to search on:select d, count(*)from ordersgroup by d;The result shows that a couple of datetime values only appear in one row in `orders`, in my test. One of these is `2020-03-30 16:47:05`.  The following query uses this date to produce a join result with exactly two rows:select *from orders o join lineitems l on o.oid = l.oidwhere o.d = "2020-03-30 16:47:05";It filters on `o.d` to find a single row of orders, then joins to `lineitems` via the hash index on `lineitems.oid`, using the new selective-join algorithm. The profiler in SingleStore Studio shows that this query runs in only one millisecond. That’s OLTP-level speed, with all the TCO advantages of columnstore. The relevant part of the profile plan shape is shown in Figure 1. The query plan works by first seeking into the orders table to get one row, in the ColumnStoreScan operator on the right. The hash build above then builds a hash table with one row in it.  Recognizing that the build side is small, the HashJoin operator dynamically switches to a nested-loop join strategy. It seeks the columnstore hash index on `lineitems`, on the left, to find matching rows. Then it completes the join and outputs two rows.
Read Post
SingleStoreDB Self-Managed 7.1 Now Generally Available
Product

SingleStoreDB Self-Managed 7.1 Now Generally Available

SingleStore is proud to announce the general availability of SingleStoreDB Self-Managed 7.1 for immediate download. SingleStoreDB Self-Managed 7.1 is also available today on SingleStoreDB Cloud, the company’s elastic cloud database, available on public cloud providers around the world.With the availability of SingleStoreDB Self-Managed 7.1, SingleStore further cements itself as the leading NewSQL platform for real-time analytics and augmented transaction processing [Ron19], delivering superb transactional and analytical performance in one system. The release also delivers improved resilience features to further strengthen SingleStore for mission-critical applications providing transaction processing, operational analytics, and more.
Read Post
What SingleStore Can Do for Time-Series Applications
Data Intensity

What SingleStore Can Do for Time-Series Applications

In earlier blog posts we described what time series data is and key characteristics of a time series database. In this blog post, which originally appeared in The New Stack, Eric Hanson, principal product manager at SingleStore, shows you how to use SingleStore for time series applications. At SingleStore we’ve seen strong interest in using our database for time series data. This is especially the case when an organization needs to accommodate the following: (1) a high rate of event ingestion, (2) low-latency queries, and (3) a high rate of concurrent queries. In what follows, I show how SingleStore can be used as a powerful time-series database and illustrate this with simple queries and user-defined functions (UDFs) that show how to do time series-frequency conversion, smoothing, and more. I also cover how to load time series-data points fast, with no scale limits. Note: This blog post was originally published in March 2019. It has been updated to reflect the new time series functions in SingleStoreDB Self-Managed 7.0. Please see the Addendum at the end of this article for specifics on using the information herein. – Ed. Manipulating Time Series with SQL Unlike most time series-specific databases, SingleStore supports standard SQL, including inner and outer joins, subqueries, common table expressions (CTEs), views, rich scalar functions for date and time manipulation, grouping, aggregation, and window functions. We support all the common SQL data types, including a datetime(6) type with microsecond accuracy that’s perfect as a time series timestamp. A common type of time-series analysis in financial trading systems is to manipulate stock ticks. Here’s a simple example of using standard SQL to do this kind of calculation. We use a table with a time series of ticks for multiple stocks, and produce high, low, open, and close for each stock: CREATE TABLE tick(ts datetime(6), symbol varchar(5), price numeric(18,4)); INSERT INTO tick VALUES ('2019-02-18 10:55:36.179760', 'ABC', 100.00), ('2019-02-18 10:57:26.179761', 'ABC', 101.00), ('2019-02-18 10:59:16.178763', 'ABC', 102.50), ('2019-02-18 11:00:56.179769', 'ABC', 102.00), ('2019-02-18 11:01:37.179769', 'ABC', 103.00), ('2019-02-18 11:02:46.179769', 'ABC', 103.00), ('2019-02-18 11:02:59.179769', 'ABC', 102.60), ('2019-02-18 11:02:46.179769', 'XYZ', 103.00), ('2019-02-18 11:02:59.179769', 'XYZ', 102.60), ('2019-02-18 11:03:59.179769', 'XYZ', 102.50); This query uses standard SQL window functions to produce high, low, open and close values for each symbol in the table, assuming that “ticks” contains data for the most recent trading day. WITH ranked AS (SELECT symbol, RANK() OVER w as r, MIN(price) OVER w as min_pr, MAX(price) OVER w as max_pr, FIRST_VALUE(price) OVER w as first, LAST_VALUE(price) OVER w as last FROM tick WINDOW w AS (PARTITION BY symbol ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) SELECT symbol, min_pr, max_pr, first, last FROM ranked WHERE r = 1; Results: +--------+----------+----------+----------+----------+ | symbol | min_pr | max_pr | first | last | +--------+----------+----------+----------+----------+ | XYZ | 102.5000 | 103.0000 | 103.0000 | 102.5000 | | ABC | 100.0000 | 103.0000 | 100.0000 | 102.6000 | +--------+----------+----------+----------+----------+ Similar queries can be used to create “candlestick charts,” a popular report style for financial time series that looks like the image below. A candlestick chart shows open, high, low, and close prices for a security over successive time intervals:
Read Post
It’s About Time: Getting More from Your Time-Series Data With SingleStoreDB Self-Managed 7.0
Product

It’s About Time: Getting More from Your Time-Series Data With SingleStoreDB Self-Managed 7.0

SingleStore is uniquely suited to real-time analytics, where data is being ingested, updated, and queried concurrently with aggregate queries. Real-time analytics use cases often are based on event data, where each separate event has a timestamp. It’s natural to interpret such a sequence of events as a time series.Prior to the 7.0 release, SingleStore delivered many capabilities that make it well-suited to time-series data management [Han19]. These include:a scaled-out, shared-nothing architecture that supports transactional and analytical workloads with a standard SQL interface,fast query execution via compilation and vectorization, combined with scale out,ability to load data phenomenally fast using the Pipelines feature, which supports distributed, parallel ingestion,non-blocking concurrency control so readers and writers never make each other wait,window functions for ranking, moving averages, and so on,a highly-compressed columnstore data format suitable for large historical data sets.Hence, many of our customers are using SingleStore to manage time series data today.For the SingleStoreDB Self-Managed 7.0 release, we decided to build some special-purpose features to make it even easier to manage time-series data. These include FIRST(), LAST(), TIME_BUCKET(), and the ability to designate a table column as the SERIES TIMESTAMP [Mem19a-d]. Taken together, these allow specification of queries to summarize time series data with far fewer lines of code and fewer complex concepts. This makes expert SQL developers more productive, and opens up the ability to query time series data to less expert developers.We were motivated to add special time series capability in SingleStoreDB Self-Managed 7.0 for the following reasons:Many customers were using SingleStore for time series data already, as described above.Customers were asking for additional time series capability.Bucketing by time, a common time series operation, was not trivial to do.Use of window functions, while powerful for time-based operations, can be complex and verbose.We’ve seen brief syntax for time bucketing in event-logging data management platforms like Splunk [Mil14] and Azure Data Explorer (Kusto) [Kus19] be enthusiastically used by developers.We believe we can provide better overall data management support for customers who manage time series data than the time series-specific database vendors can. We offer time series-specific capability and also outstanding performance, scalability, reliability, SQL support, extensibility, rich data type support, and so much more.Designating a Time Attribute in MetadataTo enable simple, brief SQL operations on time series data, we recognized that all our new time series functions would have a time argument. Normally, a table has a single, well-known time attribute. Why not make this attribute explicit in metadata, and an implicit argument of time-based functions, so you don’t have to reference it in every query expression related to time?So, in SingleStoreDB Self-Managed 7.0 we introduced a special column designation, SERIES TIMESTAMP, that indicates a default time column of a table. This column is then used as an implicit attribute in time series functions. For example, consider this table definition:CREATE TABLE tick( ts datetime(6) **series timestamp**, symbol varchar(5), price numeric(18,4));It defines a table, `tick`, containing hypothetical stock trade data. The `ts` column has been designated as the series timestamp. In examples to follow, we’ll show how you can use it to make queries shorter and easier to write.The Old Way of Querying Time SeriesBefore we show the new way to write queries briefly using time series functions and the SERIES TIMESTAMP designation in 7.0, consider an example of how SingleStore could process time series data before 7.0. We’ll use the following data for examples:INSERT INTO tick VALUES ('2020-02-18 10:55:36.179760', 'ABC', 100.00), ('2020-02-18 10:57:26.179761', 'ABC', 101.00), ('2020-02-18 10:59:16.178763', 'ABC', 102.50), ('2020-02-18 11:00:56.179769', 'ABC', 102.00), ('2020-02-18 11:01:37.179769', 'ABC', 103.00), ('2020-02-18 11:02:46.179769', 'ABC', 103.00), ('2020-02-18 11:02:59.179769', 'ABC', 102.60), ('2020-02-18 11:02:46.179769', 'XYZ', 103.00), ('2020-02-18 11:02:59.179769', 'XYZ', 102.60), ('2020-02-18 11:03:59.179769', 'XYZ', 102.50);The following query works in SingleStoreDB Self-Managed 6.8 and earlier. As output, it produces a separate row, for each stock, for each hour it was traded at least once. (So if a stock is traded ten or more times, in ten separate hours, ten rows are produced for that stock. A row will contain either a single trade, if only one trade occurred in that hour, or a summary of the trades – two or more – that occurred during the hour.) Each row shows the time bucket, stock symbol, and the high, low, open, and close for the bucket period. (If only one trade occurred in that hour, the high, low, open, and close will all be the same – the price the stock traded at in that hour.)WITH ranked AS(SELECT symbol, RANK() OVER w as r, MIN(price) OVER w as min_pr, MAX(price) OVER w as max_pr, FIRST_VALUE(price) OVER w as first, LAST_VALUE(price) OVER w as last, from_unixtime(unix_timestamp(ts) div (60*60) * (60*60)) as ts FROM tick WINDOW w AS (PARTITION BY symbol, from_unixtime(unix_timestamp(ts) div (60*60) * (60*60)) ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))SELECT ts, symbol, min_pr, max_pr, first, lastFROM rankedWHERE r = 1ORDER BY symbol, ts;This query produces the following output, which can be used to render a candlestick chart [Inv19], a common type of stock chart.+---------------------+--------+----------+----------+----------+----------+| ts | symbol | min_pr | max_pr | first | last |+---------------------+--------+----------+----------+----------+----------+| 2020-02-18 10:00:00 | ABC | 100.0000 | 102.5000 | 100.0000 | 102.5000 || 2020-02-18 11:00:00 | ABC | 102.0000 | 103.0000 | 102.0000 | 102.6000 || 2020-02-18 11:00:00 | XYZ | 102.5000 | 103.0000 | 103.0000 | 102.5000 |+---------------------+--------+----------+----------+----------+----------+The query text, while understandable, is challenging to write because it uses a common table expression (CTE), window functions with a non-trivial window definition, a subtle use of ranking to pick one row per group, and a non-obvious divide/multiply trick to group time to a 60*60 second bucket.New Time-Series Functions in SingleStoreDB Self-Managed 7.0Here I’ll introduce the new time series functions, and then show an example where we write an equivalent query to the “candlestick” query above using the new functions. I think you’ll be impressed by how concise it is!Also see the latest documentation for analyzing time series data and for the new time series functions.FIRST()The FIRST() function is an aggregate function that takes two arguments, as follows:FIRST (value[, time]);Given a set of input rows, it returns the value for the smallest associated time.The second argument is optional. If it is not specified, it is implicitly the SERIES TIMESTAMP column of the table being queried. It’s an error if there is no SERIES TIMESTAMP available, or if there is more than one available in the context of the query where FIRST is used; in that case, you should specify the time explicitly.For example, this query gives the symbol of the first stock traded among all stocks in the tick table:SELECT first(symbol) FROM tick;The result is ABC, which you can see is the first one traded at 10:55:36.179760 in the rows inserted above.LAST()LAST is just like FIRST except it gives the value associated with the latest time.TIME_BUCKET()TIME_BUCKET takes a time value and buckets it to a specified width. You can use very brief descriptions of bucket width, like ‘1d’ for one day, ‘5m’ for five minutes, and so on. The function takes these arguments:TIME_BUCKET (bucket_width [, time [,origin]])The only required argument is bucket_width. As with FIRST and LAST, the time argument is inferred to be the SERIES TIMESTAMP if it is not specified. The origin argument is used if you want your buckets to start at a non-standard boundary – say, if you want day buckets that begin at 8am every day.Putting It All TogetherNow that we’ve seen FIRST, LAST, TIME_BUCKET, and SERIES TIMESTAMP, let’s see how to use all of them to write the candlestick chart query from above. A new version of the same query is simply:SELECT time_bucket('1h') as ts, symbol, min(price) as min_pr, max(price) as max_pr, first(price) as first, last(price) as lastFROM tickgroup by 2, 1order by 2, 1;The new version of the query produces this output, which is essentially the same as the output of the original query.+----------------------------+--------+----------+----------+----------+----------+| ts | symbol | min_pr | max_pr | first | last |+----------------------------+--------+----------+----------+----------+----------+| 2020-02-18 10:00:00.000000 | ABC | 100.0000 | 102.5000 | 100.0000 | 102.5000 || 2020-02-18 11:00:00.000000 | ABC | 102.0000 | 103.0000 | 102.0000 | 102.6000 || 2020-02-18 11:00:00.000000 | XYZ | 102.5000 | 103.0000 | 103.0000 | 102.5000 |+----------------------------+--------+----------+----------+----------+----------+Look how short this query is! It is 5 lines long vs. 18 lines for the previous version. Moreover, it doesn’t use window functions or CTEs, nor require the divide/multiply trick to bucket time. It just uses standard aggregate functions and scalar functions.ConclusionSingleStoreDB Self-Managed 7.0 makes it much simpler to specify many time-series queries using special functions and the SERIES TIMESTAMP column designation. For a realistic example, we reduced lines of code by more than three-fold, and eliminated the need to use some more advanced SQL concepts.Given the high performance, unlimited scalability, and full SQL support of SingleStore, it was a strong platform for time series data in earlier releases. Now, in SingleStoreDB Self-Managed 7.0, we’ve taken that power and added greater simplicity with these new built-in capabilities. How can you apply SingleStoreDB Self-Managed 7.0 to your time-oriented data?References[Han19] Eric Hanson, What SingleStore Can Do For Time Series Applications, https://www.singlestore.com/blog/what-memsql-can-do-for-time-series-applications/, March 2019.[Inv19] Understanding Basic Candlestick Charts, Investopedia, https://www.investopedia.com/trading/candlestick-charting-what-is-it/, 2019.[Kus19] Summarize By Scalar Values, Azure Data Explorer Documentation, https://docs.microsoft.com/en-us/azure/kusto/query/tutorial#summarize-by-scalar-values, 2019.[Mem19a] FIRST, SingleStore Documentation, https://archived.docs.singlestore.com/v7.0/reference/sql-reference/time-series-functions/first/, 2019.[Mem19b] LAST, SingleStore Documentation, https://archived.docs.singlestore.com/v7.0/reference/sql-reference/time-series-functions/last/, 2019.[Mem19c] TIME_BUCKET, SingleStore Documentation, https://archived.docs.singlestore.com/v7.0/reference/sql-reference/time-series-functions/time_bucket/, 2019.[Mem19d] CREATE TABLE Topic, SERIES TIMESTAMP, https://archived.docs.singlestore.com/v7.0/reference/sql-reference/data-definition-language-ddl/create-table/, 2019.[Mil14] James Miller, Splunk Bucketing, Mastering Splunk, O’Reilly, https://www.oreilly.com/library/view/mastering-splunk/9781782173830/ch03s02.html, 2014.
Read Post
The Beauty of a Shared-Nothing SQL DBMS for Skewed Database Sizes
Engineering

The Beauty of a Shared-Nothing SQL DBMS for Skewed Database Sizes

The limitations of a typical, traditional relational database management system (RDBMS) have forced all sorts of compromises on data processing systems: from limitations on database size, to the separation of transaction processing from analytics. One such compromise has been the “sharding” of various customer data sets into separate database instances, partly so each customer could fit on a single computer server – but, in a typical power law, or Zipf, distribution, the larger databases don’t fit. In response, database application developers have had to implement semi-custom sharding schemes. Here, we describe these schemes, discuss their limitations, and show how an alternative, SingleStore, makes them unnecessary. What follows are tales of two different database application architects who face the same problem—high skew of database size for different customer data sets, meaning a few are much larger than others—and address this problem in two different ways. One tries to deal with it via a legacy single-box database and through the use of “clever” application software. The other uses a scalable database that can handle both transactions and analytics—SingleStore. Judge for yourself who’s really the clever one. The Story of the Hero Database Application Architect Once there was a database application architect. His company managed a separate database for each customer. They had thousands of customers. They came up with what seemed like a great idea. Each customer’s data would be placed in its own database. Then they would allocate one or more databases to a single-node database server. Each server would handle operational queries and the occasional big analytical query. When a server filled up, they’d just allocate additional databases to a different server. Everything was going great during development. The application code only had to be written once, for one scenario — all data for a customer fitting one DBMS server. If a database was big, no problem, just provision a larger server and put that database alone on that server. Easy. Then they went into production. Everything was good. Success brought in bigger customers with more data. Data grew over time. The big customer data grew and grew. The biggest one would barely fit on a server. The architect started losing sleep. He kept the Xanax his doctor prescribed for anxiety in the top drawer and found himself dipping into it too often. Then it happened. The biggest customer’s data would not fit on one machine anymore. A production outage happened. The architect proposed trimming the data to have less history, but the customers screamed. They needed 13 months minimum or else. He bought time by trimming to exactly 13 months. They only had two months of runway before they hit the wall again. He got his top six developers together for an emergency meeting. They’d solve this problem by sharding the data for the biggest customer across several DBMS servers. Most queries in the app could be directed to one of the servers. The app developers would figure out where to connect and send the query. Not too hard. They could do it. But some of the queries had aggregations over all the data. They could deal with this. They’d just send the query to every server, bring it back to the app, and combine the data in the app layer. His best developers actually thought this was super cool. It was way more fun than writing application software. They started to feel really proud of what they’d built. Then they started having performance problems. Moving data from one machine to the other was hard. There were several ways they could do things. Which way should they do it? Then someone had the great idea to write an optimizer that would figure out how to run the query. This was so fun. Around this time, the VP from the business side called the architect. She said the pace of application changes had slowed way down. What was going on? He proudly but at the same time sheepishly said that his top six app developers had now made the leap to be database systems software developers. Somehow, she did not care. She left his office, but it was clear she was not ready to let this lie. He checked the bug count. Could it be this high? What were his people doing? He’d have to fix some of the bugs himself. He started to sweat. A nervous lump formed in the pit of his stomach. The clock struck 7. His wife called and said dinner was ready. The kids wanted to see him. He said he’d leave by 8. The Story of the Disciplined Database Application Architect Once there was a database application architect. His company managed a separate database for each customer. They had thousands of customers. They at first considered what seemed like a great idea. Each customer’s data would be placed in its own database on a single-node database server. But, asked the architect, what happens when there’s more data than will fit on one machine? One of the devs on his team said he’d heard of this scale-out database called SingleStore that runs standard SQL and can do both operational and analytical workloads on the same system. If you run out of capacity, you can add more nodes and spread the data across them. The system handles it all automatically. The dev had actually tried the free version of SingleStore for a temporary data mart and it worked great. It was really fast. And running it took half the work of running their old single-box DBMS. All their tools could connect to it too. They decided to run just a couple of SingleStore clusters and put each customer’s data in one database on one cluster. They got into production. Things were going great; business was booming. Their biggest customer got really big really fast. It started to crowd out work for other customers on the same cluster. They could see a problem coming. How could they head it off? They had planned for this. They just added a few nodes to the cluster and rebalanced the biggest database. It was all done online. It took an hour, running in the background. No downtime. The VP from the business side walked in. She had a new business use case that would make millions if they could pull it off before the holidays. He called a meeting the next day with the business team and a few of his top developers. They rolled up their sleeves and sketched out the application requirements. Yeah, they could do this. Annual review time came around. His boss showed him his numbers. Wow, that is a good bonus. He felt like he hadn’t worked too hard this year, but he kept it to himself. His golf score was down, and his pants still fit just like in college. He left the office at 5:30. His kids welcomed him at the door. The Issue of Skewed Database Sizes The architects in our stories are facing a common issue. They are building services for many clients, where each client’s data is to be kept in a separate database for simplicity, performance and security reasons. The database sizes needed by different customers vary dramatically, following what’s known as a Zipf distribution [Ada02]. In this distribution, the largest databases have orders of magnitude more data than the average ones, and there is a long tail of average and smaller-sized databases. In a Zipf distribution of database sizes, the size y of a database follows a pattern like size(r) = C r^(-b)* with b close to one, where r is the rank, and C is a constant, with the largest database having rank one, the second-largest rank two, and so on. The following figure shows a hypothetical, yet realistic Zipf distribution of database size for b = 1.3 and C = 10 terabytes (TB). Because of the strong variation among database sizes, the distribution is considered highly skewed.
Read Post
SingleStore’s Patented Universal Storage - Part 1
Product

SingleStore’s Patented Universal Storage - Part 1

SingleStore Universal Storage is a new vision for how databases can work – first blurring and then, for most use cases, erasing any apparent difference between today’s rowstore and columnstore tables. In SingleStore Universal Storage™ Phase 1, shipping as part of SingleStoreDB Self-Managed 7.0 (currently in beta), rowstore tables get null compression, lowering TCO in many cases by 50%. Columnstore tables get seekable columnstores, which support fast seeks and updates, giving columnstore tables many of the performance and usability features of rowstore tables. The hard choices developers have faced up to now between rowstore and columnstore tables – or between separate rowstore and columnstore database software offerings – are significantly reduced, cutting costs and improving performance. With the new system of record improvements, also offered in SingleStoreDB Self-Managed 7.0, our vision of “one database to rule them all” begins to be realized. This is the first in a series of four articles that describe SingleStore's unique, patented Universal Storage feature. Read Part 2, Part 3 and Part 4 in the series to learn the whole story. Introducing SingleStore Universal Storage SingleStore Universal Storage is a breakthrough in database storage architecture to allow operational and analytical workloads to be processed using a single table type. This will simplify the developer’s job while providing tremendous scalability and performance, and minimizing costs. As a significant first step, in SingleStoreDB Self-Managed 7.0, Universal Storage Phase 1 allows OLTP applications to use columnstore tables to run operational transactions on data much bigger than RAM. This is supported via new hash indexes and related speed and concurrency improvements for disk-based columnstore tables, delivering seeks and updates at in-memory speeds. Universal Storage Phase 1 also now supports transactional applications on larger data sets more economically, via in-memory compression for null values in fast, memory-based rowstore tables, with memory savings of roughly 50% for many use cases. Together, these improvements give SingleStore customers better performance at lower cost, the flexibility to get the most from computing resources, and the ability to tackle the largest data management problems economically. Our Vision for the Ultimate Table Format SingleStore supports two types of data tables in the same database: in-memory rowstores, which are ideal for online transaction processing (OLTP) and hybrid transactional/analytical (HTAP) applications, and disk-based columnstores, which are the best choice for purely analytical applications. Customers love the speed and predictability of rowstores for OLTP and HTAP. They also love the truly incredible analytical performance of columnstores, plus their ability to store far more data than will fit in RAM economically. But customers have been asking for us to improve the total cost of ownership (TCO) of rowstores, because they have to provision servers with large amounts of RAM when tables get big, which can be costly. They’ve also asked for us to add OLTP-like features to columnstores, such as fast UPSERTS and unique constraints. In response, we’ve developed a vision of the future in which one table type can be used for OLTP, HTAP, and analytics on arbitrarily large data sets, much bigger than the available RAM, all with optimal performance and TCO. We call this SingleStore Universal Storage. Our ultimate goal for Universal Storage is that performance for OLTP and HTAP is the same as for a rowstore table, if the amount of RAM that would have been required for an explicitly defined rowstore table is available, and that OLTP performance degrades gracefully if less RAM than that is available. For analytics, utilizing large scans, joins, aggregates, etc., the goal is to provide performance similar to that of a columnstore table. The old-fashioned way to support OLTP on tables bigger than RAM would be to use a legacy storage structure like a B-tree. But that could lead to big performance losses; we need to do better. In the Universal Storage vision, we preserve the performance and predictability of our current storage structures and compiled, vectorized query execution capability, and even improve on it. All while reducing the complexity and cost of database design, development, and operations by putting more capability into the database software. In the 7.0 release, we are driving toward solving the customer requirements outlined above, and ultimately realizing our vision for a “Universal Storage” in two different ways. One is to allow sparse rowstores to store much more data in the same amount of RAM, thus improving TCO while maintaining great performance, with low variance, for seeks. We do this through sparse in-memory compression. The other is to support seekable columnstores that allow highly concurrent read/write access. Yes, you read that right, seekable columnstores. It’s not an oxymoron. We achieve this using hash indexes and a new row-level locking scheme for columnstores, plus subsegment access, a method for reading small parts of columnstore columns independently and efficiently. In what follows, we’ll explain how we achieve a critical first step in achieving our vision for a Universal Storage in SingleStoreDB Self-Managed 7.0. And this is just the beginning. Sparse Rowstore Compression To address the TCO concerns of our customers with wide tables that have a high proportion of NULL values – a situation often found in the financial sector – we’ve developed a method of compressing in-memory rowstore data. This relies on a bitmap to indicate which fields are NULL. But we’ve put our own twist on this well-established method of storing less data for NULL values by maintaining a portion of the record as a structure of fixed-width fields. This allows our compiled query execution and index seeking to continue to perform at the highest levels. We essentially split the record into two parts: a fixed-width portion containing non-sparse fields and index keys, and a variable-width portion containing the fields which have been designated as sparse. Our NULL bitmap makes use of four bits per field instead of one, to enable room for future growth. We’ve created a pathway where we can add the ability to compress out default values like blanks and zeros, and also store normally fixed fields as variable-width fields – e.g., storing small integers in a few bytes, rather than 8 bytes, if they are declared as bigints. The following figure illustrates how sparse compression works for a table with four columns, the last three of which are designated as SPARSE. (Assume that the first one is a unique NOT NULL column, so is not designated as SPARSE). The fact that the first column is not sparse and the last three columns may be sparse is recorded in table-level metadata.
Read Post
Query Processing Improvements in SingleStoreDB Self-Managed 6.8
Data Intensity

Query Processing Improvements in SingleStoreDB Self-Managed 6.8

In this blog post, I’ll focus on new query processing capabilities in SingleStoreDB Self-Managed 6.8. The marquee query feature is just-in-time (JIT) compilation, which speeds up query runtimes on the first run of a query – now turned on by default. We have also improved performance of certain right and left outer joins and related operations, and Rollup and Cube. In addition, we add convenience features, including sub-select without an alias, and extended Oracle compatibility for date and time handling functions. Finally, new array functions for splitting strings and converting JSON data are added. Other improvements in 6.8 are covered elsewhere. These include: secured HDFS pipelinesimproved pipelines performanceLOAD DATA null column handling extensionsinformation schema and management views enhancements Now, let’s examine how just in time queries can work in a database. Speeding up First Query Runtimes SingleStore compiles queries to machine code, which allows us to get amazing performance, particularly when querying our in-memory rowstore tables. By spending a bit more time compiling than most databases – which interpret all queries, not compiling them – we get high performance during execution. This works great for repetitive query workloads, such as real-time dashboards with a fixed set of queries and transactional applications. But our customers have been asking for better performance the first time a query is run, which is especially applicable for ad hoc queries – when slower performance can be especially noticeable. In SingleStoreDB Self-Managed 6.7, we first documented a JIT feature for SQL queries, enabled by running ‘set interpretermode = interpret_first’. Under this setting, SingleStore starts out interpreting a query, compiles its operators in the background, then dynamically switches from interpretation to execution of compiled code for the query _as the query runs the first time. The interpret_first setting was classified as experimental in 6.7, and was off by default. In 6.8, we’re pleased to say that interpret_first is now fully supported and is on by default. This setting can greatly improve the user’s experience running ad hoc queries, or when using any application that causes a lot of new SQL statements to be run, as when a user explores data through a graphical interface. The interpret_first setting can speed up the first run of a large and complex query – say, a query with more than seven joins – several times by reducing compile overhead, with no loss of performance on longer-running queries for their first run. Rollup and Cube Performance Improvements Cube and Rollup operator performance has been improved in SingleStoreDB Self-Managed 6.8 by pushing more work to the leaf nodes. In prior releases, Cube and Rollup were done on the aggregator, requiring more data to be gathered from the leaves to the aggregator, which can take more time. For example, consider the following query from the Cube and Rollup documentation: SELECT state, product_id, SUM(quantity) FROM sales GROUP BY CUBE(state, product_id) ORDER BY state, product_id; The graphical query plan for this in 6.8, obtained using SingleStore Studio, is the following:
Read Post
How to Use SingleStore with Intel’s Optane Persistent Memory
Product

How to Use SingleStore with Intel’s Optane Persistent Memory

Intel’s new Optane DC persistent memory adds a new performance option for SingleStore users. After careful analysis, we’ve identified one area in which SingleStore customers and others can solve a potentially urgent problem using Optane today, and we describe that opportunity in this blog post. We also point out other areas where SingleStore customers and others should keep an eye on Optane-related developments for the future. If you need broader information than what’s offered here, there are many sources for more comprehensive information about Optane and what it can do for you, beginning with Intel itself.
Read Post
DZone Webinar – SingleStore for Time Series, Real Time, and Beyond
Data Intensity

DZone Webinar – SingleStore for Time Series, Real Time, and Beyond

Eric Hanson, Principal Product Manager at SingleStore, is an accomplished data professional with decades of relevant experience. This is an edited transcript of a webinar on time series data that he recently delivered for developer website DZone. Eric provided an architect’s view on how the legacy database limits of the past can be solved with scalable SQL. He shows how challenging workloads like time series and big data analytics are addressed by SingleStore, without sacrificing the familiarity of ANSI SQL. You can view the webinar on DZone. Time series data is getting more and more interest as companies seek to get more value out of the data they have – and the data they can get in the future. SingleStore is the world’s fastest database – typically 10 times faster, and three times more cost effective, than competing databases. SingleStore is a fully scalable relational database that supports structured and semi-structured data, with schema and ANSI SQL compatibility. SingleStore has features that support time series use cases. For time series data, key strengths of SingleStore include a very high rate of data ingest, with processing on ingest as needed; very fast queries; and high concurrency on queries. Key industries with intensive time series requirements that are using SingleStore today include energy and utilities; financial services; media and telecommunications; and high technology. These are not all the industries that are using SingleStore, but these four in particular, we have a lot of customers in these industries and these industries use time series data. Editor’s Note: Also see our blog posts on time series data, choosing a time series database, and implementing time series functions with SingleStore. We also have an additional recorded webinar (from us here at SingleStore) and an O’Reilly ebook download covering these topics. Introduction to SingleStore SingleStore has a very wide range of attributes that make it a strong candidate for time series workloads. You can see from the chart that SingleStore connects to a very wide range of other data technologies; supports applications, business intelligence (BI) tools, and ad hoc queries; and runs everywhere – on bare metal or in the cloud, in virtual machines or containers, or as a service. No matter where you run it, SingleStore is highly scalable. It has a scale-out, shared-nothing architecture.
Read Post
SingleStoreDB Self-Managed 6.7 Performance Improvements
Engineering

SingleStoreDB Self-Managed 6.7 Performance Improvements

Performance is in SingleStore’s DNA. Last March, we shattered the trillion-rows-per-second scan barrier for processing a single SQL query on industry-standard Intel servers. That query processed data from a single table, and of course, lots of analytical queries use multiple tables, particularly star schemas. So we’ve broadened our vectorized, single instruction, multiple data (SIMD) query execution technology beyond single-table, group-by aggregate queries to star join group-by aggregate queries. A star join is a common kind of query that operates on what’s known as a star schema, used in most data warehouses and data marts, and some operational analytics applications as well. In a star schema, a single large table called a fact table is linked to several smaller tables called dimension tables. Star join queries typically join these tables together, aggregate numeric “measure” columns from the fact table, and group by descriptive fields from the dimension tables. Our star join performance was already excellent. This improvement makes it ludicrously good. In addition to star join performance improvements, we’ve made many other improvements to query and load speed, as I’ll describe in this blog post. When combined with SingleStore’s existing high performance, what this improvement means for application developers building real-time analytics systems, data warehouses, and data marts is that they can use SingleStore to get stunning performance at concurrency levels they couldn’t have dreamed of before. And they can do it with a database that supports mixed workloads and runs ANSI SQL. Star Join Performance Improvements We’ve added proprietary, patent-pending new algorithms for star join that make use of vectorization and SIMD. These algorithms operate directly on our compressed columnstore data formats, which we call encoded data. Instead of doing a hash join in the traditional way, where each row of the “probe-side” table is used to do a function call to search into a hash table created from the “build-side” table, we now have a special implementation of hash join that doesn’t do function calls in the inner loop. Instead, it uses generated, templatized code to process part of the work for multiple probes at once in a single SIMD instruction, operating directly on encoded data. To demonstrate this, I created a basic star schema data set. Since we have quite a few customers in the media market, I made a media-oriented example with one fact table and three dimensions, like so: Table Number of rows Description `media_view_fact` 1,310,720,000 Fact table describing a view event for a media object (e.g., a web page) `date_dim` 2,556 One row with descriptive properties for each day for 7 years `user_dim` 1,000 One row for each system user `item_dim` 10,000 One row for each media item being tracked I created a basic, but realistic, star join query that forces processing of every single row from the fact table, does a join, and groups by columns from a dimension, as follows: select d_daynuminweek, d_dayofweek, count(*) as c from media_view_fact f, date_dim d where f_datekey = d_datekey group by 1, 2 order by 1 asc; The hardware I used was a single Intel Skylake server with two 2.6Ghz processors and 56 total cores. Data was partitioned evenly with one partition per core. I configured the system with one SingleStore aggregator node and two SingleStore leaf nodes, one leaf for each processor. The system had non-uniform memory access (NUMA) enabled and each leaf was on a separate NUMA node. I ran this query twice – once with, and once without, the new encoded join capability enabled. The results are summarized below: Encoded joins enabled? Average runtime 30 runs no 3.01s yes 0.0297s speedup (times) 101 No, this is not a typo. This is a 101 times speedup! The data is not pre-aggregated. The speedup is due to operating directly on encoded data, using SIMD and the enhanced join algorithm. What does this mean? Now, all existing applications can get far faster response times and concurrent throughput. Even more exciting is that you can create new applications that allow interactive analytics on large data sets with rapidly changing data, without resorting to the complexity of pre-aggregating data. More complex queries also show substantial speedups. For example, here’s a four-way star join with filters. select d_dayofweek, count(*) from media_view_fact, date_dim, user_dim, item_dim where f_datekey = d_datekey and f_userkey = u_userkey and f_itemkey = i_itemkey and i_itemuri regexp 'http://www.c000[0-5].*' and u_zipcode like '002%' and d_year = 1997 and d_month = "January" group by d_dayofweek; The results are as follows: Encoded joins enabled? Runtime no 0.09s yes 0.03s speedup (times) 3 The fact table is sorted (keyed) by f_datekey, and we’ve supported range (segment) elimination via join since our 6.0 release. So the date range filter implied by this join requires us to only read one year of data, not all seven years. And the filter effects of the join are employed during the scan of the fact table both with and without encoded joins enabled, via Bloom filters or a related approach we use in SingleStoreDB Self-Managed 6.7. So the speedup is not as dramatic as the 101X speedup for the previous query. But a 3X speedup, on top of SingleStore’s generally high performance, is amazing nevertheless! Star Schema Benchmark Release-to-Release Gains The well-known star schema benchmark (SSB) was already performing very well in SingleStoreDB Self-Managed 6.5, mainly because the joins in the queries in SSB are highly selective. That means that most rows in the fact table are filtered out by joins with the dimension tables. So vectorized testing of Bloom filters, available in 6.5, worked well to speed up these queries. Still, the new star join query execution technology in 6.7 has further improved this workload. The hardware for this test used four AWS m4.2xlarge leaf nodes (8 logical cores, 32GB RAM each). The primary fact table has 600 million rows. There are no precomputed aggregates. Query results are computed from scratch. This chart illustrates the gains:
Read Post
Performance Improvements in SingleStoreDB Self-Managed 6.5
Product

Performance Improvements in SingleStoreDB Self-Managed 6.5

With SingleStoreDB Self-Managed 6.5, the fastest database just got a lot faster. SingleStore is a performance-oriented product that capitalizes on several techniques to give excellent overall speed. These techniques include optimized disk-based and in-memory data structures, parallelism, scale out, compilation of queries to machine code, vectorization, and single instruction, multiple data (SIMD).In SingleStoreDB Self-Managed 6.5, we advanced performance in many dimensions, including query execution (QE), query optimization (QO), data loading, and system management. SingleStoreDB Self-Managed 6.0 was already fast. Very fast. But we were not satisfied. QE performance is always a high priority for us since it’s what users see most when exploring data. We improved QE performance for the following operations:Shuffle (data redistribution)High-cardinality GROUP BYIN-list filteringFiltering of data stored with integer run-length encoding (RLE)Internal memory allocationFor many users, the improved shuffle and high-cardinality GROUP BY performance will be the most noticeable advancements. In an internal test workload that we run based on the TPC-H benchmark, the average query speed improved by 2.2X, over a set of 22 queries. The IN-list filter, integer RLE filtering, and memory allocation improvements can all give 10X or more query speedups in the right situation.The query optimizer and statistics systems have improved significantly. The blazing-fast SIMD QE we used to demonstrate a trillion-row-per-second query with our 6.0 release had gotten so fast that our query optimizer didn’t always know when to use it. So we re-calibrated the optimizer to use it when it’s best suited. Also, we’ve improved our statistics (histograms) to have much higher resolution, which will lead to better query plans.Finally, loading and system management have improved. Loading from Amazon S3 cloud storage using our Pipelines feature can start in seconds in SingleStoreDB Self-Managed 6.5 versus minutes in SingleStoreDB Self-Managed 6.0 for large S3 buckets. Load speed for wide comma-separated value list files with quoted strings is over twice as fast. Columnstore loading and all other operations that require large sorts are far faster. And finally, restart recover for clusters with hundreds of databases is more than ten times as fast.Query PerformanceHere, we’ll examine all the query performance improvements in SingleStoreDB Self-Managed 6.5 in more depth.Fast ShuffleSingleStore is a distributed, shared-nothing database. Applications connect to aggregator nodes. Aggregators compile queries, start execution, and assemble results. The data is stored across multiple leaf nodes using hash partitioning (sharding). The execution of queries requires data to be moved between leaf nodes, as well as from leaves to aggregators.Data movement takes time. And we don’t like it when queries take too much time. Which lead us to create Project Sanic. When Ya Gotta Go Fast! (Google images of “Sanic” and you’ll see our project mascot). Sanic is all about speeding up data movement in SingleStore.Before SingleStoreDB Self-Managed 6.5, shuffle heavily relied on converting data to text form, then back to an internal binary form for processing at the receiving node. 6.5 shuffles data primarily in binary format. For certain types of data, especially numbers and dates, this saves a lot of CPU time on both the sending and receiving ends, because we don’t have to take time to convert to string form and back again.Consider a simple example. Suppose we create this table:create table t(a int, b int, shard(a), key(a));Then we insert about 164 million rows into it, with unique integer values for every row for both columns `a` and `b`.This table is sharded (hash partitioned) by column `a` across leaf nodes. Each leaf node has an index on column `a`. Now, suppose we want to run this query:select count(distinct b) from t;A highly parallelizable version of this query repartitions data by column `b` using hashing, then performs a local distincting operation on each leaf. Finally, these results are concatenated together at the aggregator, and the result is returned to the client.The query above runs on a 4-leaf system with 16 cores and 16 partitions total in the following times:SingleStoreDB Self-Managed 6.0SingleStoreDB Self-Managed 6.5Speedup (times)26.57 sec8.74 sec3.04This healthy 3.04 times speedup is due completely to shuffle performance improvements.High-Cardinality GROUP BYHigh-cardinality GROUP BY queries are ones that have many distinct grouping keys (say 10s of thousands or more). In 6.0, these queries computed a local grouping for each partition, and each of these was forwarded to the aggregator, which did the final grouping. If there were many partitions per node, this could have involved a lot of data transmission causing a network bottleneck and leaving the aggregator with a lot of work to do to form the final aggregate set.In 6.5, performance of queries like this is improved by (a) doing a local aggregate at the leaf level of all the partitions on that leaf, (b) spilling the rows to the network if it turns out almost every GROUP BY key is distinct so the local aggregation is not helping, and (c) reducing memory usage by having each thread handle a subset of the keys.The realistic best-case scenario query for this improvement is a GROUP BY that is shuffling (not shard key matching) and each partition has a large set of keys but the set of keys for each partition is roughly the same. Shard-key-matching GROUP BY is not affected by this change.The key part of this improvement is the local aggregate at the leaf level, since it allows more of the work to be done by the leaves and less by the aggregators, and it reduces network traffic.Performance Improvement:This improvement can give several-times speedups for some queries. For example, query q22 in one of our internal test workloads derived from TPC-H speeded up by more than a factor of 2 based solely on this change.IN-list Filter ImprovementPerformance has been improved for queries that use IN-list filters. 6.0 required a full scan of the IN-list for each row. 6.5 uses a Bloom filter to check if there is any match to help increase scan speed.For IN-list filters that disqualify the large majority of rows, this can give order-of-magnitude speedups. This improvement works for row store and column store tables. It is always enabled.Here’s an example that shows the dramatic speedup possible for IN-list queries in 6.5. For a table t(i int) with 8 million rows of unique integers, in row store format, we ran the following queries using 6.0 and 6.5:Query6.56.0`select count(*) from t where i in (1, 100, 200);`0.23 secsame`select count(*) from t where i in (1, 2, ... 904);`0.20 sec (60x speedup!)12.0 secThe second query, with an IN-list containing 904 elements, speeds up tremendously, while the first one, with a short IN-list, stays the same.Integer RLE Filter PushdownRun-length encoding (RLE) is one of several compression strategies used for our columnstore tables. SingleStoreDB Self-Managed 6.5 can now push integer filters down to RLE-encoded data in the columnstore scan layer, giving significant speedup.Here’s a performance example. The table t(i int) is a columnstore with 10 million rows and 1,000 unique values, keyed on i. So it will be encoded with RLE.select count(*) from t where i = 2;The performance results on two cores with two partitions are as follows:6.06.50.11 sec0.002 secAs you can see, this enhancement can provide order-of-magnitude improvements. It’s always on; there is no need to set any switches to control it.Internal Memory Allocation SpeedupQuery execution memory allocation performance was significantly improved. Some TPC-DS benchmark queries speed up by over 2X based on this change. This microbenchmark speeded up by a factor of 20 compared to SingleStoreDB Self-Managed 6.0:select count(substr(str, 0, 9)) from t;The improvement is in the internal memory allocator used by queries that contain non-trivial string expressions or create temporary rows. So queries such as this can become far faster.Query Optimizer and StatisticsQE improvements are fantastic, but the QO system is always critical to make sure the right QE operators are used in the best possible way. Our query optimizer is the guidance system of the SingleStoreDB Self-ManagedMS.Automatic Selection of Hash Group to Get Operations on Encoded DataAs we publicized in our Trillion Rows Per Second demo, in SingleStoreDB Self-Managed 6.0 we improved our query execution dramatically for single-table GROUP BY/aggregate queries, by using a new implementation of HashGroupBy that uses SIMD and operations on encoded data. In 6.5, we’ve improved the costing code in the query optimizer by calibrating it so it understands more accurately the relative speed of HashGroupBy and StreamingGroupBy operations. So queries like this now routinely get a HashGroupBy instead of StreamingGroupBy:select stock_symbol, count(*) as cfrom tradegroup by stock_symbolorder by c desclimit 10;The table `trade` is a `columnstore`, sharded on `id` or `stock_symbol` and with key on `stock_symbol`.In effect, our query optimizer has caught up with the new capabilities available in our query execution system. The result is that queries may now run up to 80 times faster, automatically, with no need for hints or other workarounds.Advanced Histograms for Improved Filter Cardinality EstimationSingleStoreDB Self-Managed 6.5 introduces a new type of histogram to represent what we sometimes refer to as “range” statistics. These histograms are termed “advanced histograms” and the existing histograms will be called “legacy histograms.” The advanced histograms hold substantially more information than the legacy histograms for certain data distributions. So, non-uniform data distributions with skew typically have much better estimates for less frequent values, and also better average estimates, than in SingleStoreDB Self-Managed 6.0. For example, for this data distribution:+------+----------+| v    | count(*) |+------+----------+|    1 | 320000 ||    2 |    320 ||    3 |     352 ||    4 |     256 ||    5 |       1 ||    6 |     512 ||    7 |     512 ||    8 |     544 ||    9 |     512 ||   10 |     608 ||   11 |      3 ||   12 |    608 ||   13 |    576 ||   14 |     288 ||   15 |     288 ||   16 |      288 ||   17 |     160 ||   18 |     352 ||   19 |       2 ||   20 |     480 |+------+----------+And this query:select count(*) from t where v = <constant>;We get these estimates:Value of `<constant\>`actual6.0 estimated6.5 estimated1320,000320,047320,1743352450366514501In 6.0, the high-frequency values (such as 1) crowded out information about the lower-frequency values (such as 5) in the histogram, causing loss of resolution. In 6.5, this crowding-out effect happens much less quickly. Normally up to 50 data values can be represented nearly exactly, regardless of data frequency skew. In a larger query, errors such as this can compound and cause a sub-optimal strategy to be chosen for a join (e.g. hash instead of nested loop), or a distributed operation (e.g. shuffle instead of broadcast). Since these errors are reduced in 6.5, overall query performance will improve.Before-and-After Results, 6.0 to 6.5, TPC-HQuery performance in one of our internal test workloads derived from the TPC-H benchmark improved dramatically, about 2.2X on average, from 6.0 to 6.5. The primary reason for this is improved shuffle performance and, to a lesser extent, the high-cardinality GROUP BY performance improvement. Also, query q19 improved due to a rewrite to extract common conditions for OR ((a or b) or (a or c)) → (a and (b or c)).SingleStore TPC-H Scale Factor 1000, May 2018, Hardware: 3 Amazon Web Services (AWS) r3.8xlarge leaves, 32 vcores eachQuery6.0 Time (sec)6.5 Time (sec)Speedup (%)Speedup (times)Notesq15.8245.387.62%1.1q23.5372.97715.83%1.2q345.66118.05260.47%2.5q46.4376.47-0.51%1.0q522.12712.66942.74%1.7q63.3343.392-1.74%1.0q717.4789.64144.84%1.8q84.974.02319.05%1.2q978.5533.86656.89%2.3q1015.39511.28726.68%1.4q113.6543.4954.35%1.0q123.8214.252-11.28%0.9q1362.55839.00137.66%1.6q1475.24421.52871.39%3.5q1524.6689.93159.74%2.5q1640.46112.2469.75%3.3q177.213.2155.48%2.2q1830072.38775.87%4.1query timed out with error on 6.0q196.6914.99825.30%1.3q2070.2762663.00%2.7q21149.832143.0614.52%1.0q2246.2979.6979.07%4.8Averages33.020.837%2.0(GEOMEAN)(GEOMEAN)(AVERAGE)(AVERAGE)†q18 did not run in under 300 sec on 6.0 so its time is given as the timeout of 300 sec.Load PerformanceBefore you can experience the benefits of improved query speed, you have to get data into your database. Load performance is historically a strength of SingleStore. We’ve made it even better 6.5.Speedup of CSV File ParsingLOAD DATA commands have been made substantially faster. For wide tables with long comma-separated value (CSV) input lines, using strings enclosed with quotes or other characters, speedup of more than 2X is possible.We micro-optimized parsing. The most important change only applies to LOAD DATA with a FIELDS [OPTIONALLY] ENCLOSED BY clause, and is always “on.” It’s expected to be most beneficial for tables with many columns and relatively long lines. It only affects work done on the aggregator during the LOAD, and it doesn’t apply to pipelines.Loading 6GB of data – with average CSV line length of 1KB – into a 170 column table on a cluster on AWS with an m4.10xlarge aggregator + 4 m4.4xlarge leaves and 32 partitions total, we saw:MySQL [pm]> LOAD DATA INFILE '/home/admin/portable/LOAN_KEY.csv' INTO TABLE LOAN_KEY FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';Query OK, 5029152 rows affected (31.14 sec)Before the change, the same load took about twice as long:MySQL [pm]> LOAD DATA INFILE '/home/admin/portable/LOAN_KEY.csv' INTO TABLE LOAN_KEY FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';Query OK, 5029152 rows affected (1 min 9.76 sec)Pipelines from S3 SpeedupThe time for the CREATE PIPELINE statement to return, when used on Amazon S3 cloud storage, has been reduced dramatically for S3 prefixes with very large numbers of files under them. For example, for a million files under a prefix, it previously took about 5 minutes, and now it typically takes under 2 seconds.In addition, under an S3 path with a large number of files, when a new file is added, it will be loaded and visible in the SingleStore database much more quickly than before, typically in seconds rather than minutes.Previously the time taken to create an S3 pipeline grows with the number of objects in the source S3 bucket (more precisely, the number of objects matching the specified prefix) and is unbounded. For example, create pipeline took about 8 minutes on a bucket with 1.6 million objects. The root cause of the issue is the long time it can take to list out all the files under and S3 bucket. In 6.5 we fixed this issue and the number of network round trips taken equals (number of SingleStore partitions) / 1,000, rounded up. Usually this translates to no more than 2 seconds.The second use case that is improved in 6.5 follows the “continuous loading” pattern where news files are added to the bucket and we expect to be able to query the data in SingleStore as soon as possible after the files are uploaded. Previously, this latency grew linearly with the number of existing files in the bucket. For example, suppose there are 200,000 files and a new file is uploaded. The rows from these files will not start to be loaded for about 20 seconds. If there are 2 million files, the latency would be over 3 minutes. Since S3 is often used for historical data, it is not uncommon that buckets grow to the extent that real-time analytics becomes infeasible. (Note that the overall bandwidth is not affected and therefore one-time load from S3 using SingleStore pipelines prior to 6.5 is not problematic.) In 6.5, this latency is not affected by the number of files in the bucket, given that newly added keys are greater in sort order.Best practices: To take best advantage of this improvement, it is recommended that customers add keys to their S3 bucket in ascending sort order. A random order of names in a large set of files can still cause a long delay before seeing the contents of a new file in the database. For example, if you name files with a string of the form “prefix_name_<timestamp>” where <timestamp> is of the form YYYY-MM-DD-HH:MM:SS.FFF then the file names will naturally be in sort order, so you’ll benefit from this new improvement.Columnstore Compression PerformanceColumnstore compression performance has been improved, particularly for dictionary-compressed column segments with under 2^16 values.Overall, LOAD and heavy INSERT workloads for low- to medium-cardinality strings are about 2X faster.This enhancement improves the speed of the background merger and flusher too.Sort Speed PerformanceThe speed of the internal sort routine was improved. This is related to the improvements in the internal memory allocator that were previously mentioned. This improves performance of the background merger used by the columnstore, and any other internal system processes that sort data.ConclusionSingleStore has fundamental architectural advantages for performance compared with legacy SQL database systems, including scale out, compilation, in-memory structures, vectorization, and use of SIMD. In the 6.5 release, we’ve built on these architectural pillars to deliver tremendous performance gains. Some queries have speeded up over 60X, and some entire workloads have more than doubled in speed, with no application changes required.SingleStore is a database on the move. We have a talented engineering team that is keeping up a tremendous pace of improvement from release to release. Given our fundamental advantages, run-anywhere flexibility, and great performance that is only getting better, SingleStore is a great train to get aboard with your demanding data management applications.
Read Post
Shattering the Trillion-Rows-Per-Second Barrier With SingleStore
Product

Shattering the Trillion-Rows-Per-Second Barrier With SingleStore

Last week at the Strata Data Conference in San Jose, I had the privilege of demonstrating SingleStore processing over a trillion rows per second on the latest Intel Skylake servers. It’s well known that having an interactive response time of under a quarter of a second gives people incredible satisfaction. When you deliver response time that drops down to about a quarter of a second, results seem to be instantaneous to users. But with large data sets and concurrency needs, giving all customers that level of speed can seem beyond reach. So developers sometimes take shortcuts, such as precomputing summary aggregates. That can lead to a rigid user experience where if you tweak your query a little, for example adding an extra grouping column, suddenly it runs orders of magnitude slower. And it also means your answers are not real time, i.e. not on the latest data. SingleStore gives you the interactive response time your users want, on huge data sets, with concurrent access, without resorting to precomputing results. Running at a Trillion Rows Per Second SingleStoreDB Self-Managed 6, which shipped in late 2017, contains new technology for executing single-table group-by/aggregate queries on columnstore data incredibly fast. The implementation is based on these methods: (1) operations done directly on encoded (compressed) data in the columnstore, (2) compilation of queries to machine code, (3) vectorized execution, and (4) use of Intel AVX2 single instruction, multiple data (SIMD) enhancements. When the group-by columns are encoded with dictionary, integer value, or run-length encoding, SingleStore runs a one-table group-by/aggregate at rates exceeding three billion rows per second per core at its peak. The fewer the number of groups and the simpler the aggregate functions, the faster SingleStore goes. This incredible per-core speed gave us the idea to shoot for the trillion-rows-per-second mark. To accomplish this, with a realistic query, I wrote a data generator to build a data set that simulates stock trades on the NASDAQ. Then we talked to our partners at Intel, and they generously gave us access to servers in their lab with the latest Skylake processors. These machines have two Intel® Xeon® Platinum 8180 processors each, which have 28 cores, for a total of 56 cores per server. I created a SingleStore cluster with one aggregator node and eight leaf nodes, with one server for each node, as shown in Figure 1. This cluster had 2 * 28 * 8 = 448 total cores on the leaves — the most important number that determined the overall rows-per-second rate we could get.
Read Post
Delivering Scalable Self-Service Analytics
Data Intensity

Delivering Scalable Self-Service Analytics

Within 48 hours of launching Google Analytics as a free product, virtually all of Google’s servers crashed. Eric Schmidt called this Google’s “most successful disaster.” Why would a free product, whose hardware requirements melted down a datacenter, be worth it? Wesley Chan, the creator of Google Analytics, later said that, “Google Analytics generates about three billion dollars in extra revenue,” as noted in Steven Levy’s book, In The Plex. Google Analytics allowed Google’s customers to measure how good AdWords actually were, and showed them, with their own data, exactly how high they could increase bids and still make money. As Chan said, “know more, spend more.” The full potential of such an offering comes when customers are allowed to arbitrarily segment and calculate flexible aggregates. To do that, they need to be able to query raw unaggregated data. This way your company does not have to guess what the customer wants to aggregate, as all choices remain available. If raw data access is not provided, then data must be precomputed, at least on some dimensions, which limits flexibility and the extent of the insights users can get from data. Cost and technology constraints have led most companies to build analytics with this precompute approach for customers, because they need to serve analytics to many customers concurrently. The scale required to offer raw data access remained untenable. It was unthinkable to perform computations on raw data points on the scale of billions of rows per request concurrently for thousands of customers. Today, SingleStore is changing that conventional wisdom and offering companies the ability to serve raw unaggregated data performance to a range of customers. To explain this capability further, there are three major pieces of technology in this use case: Scale-outColumnstore query executionEfficient data isolation Scale-Out Knowing system performance characteristics on a per-core basis, users can calculate how much compute and storage is needed to serve analytics at scale. Once that calculation is done, the key is to utilize a distributed system allowing enough dedicated compute power to meet demand. SingleStore can be used to run one to hundreds of nodes, which lets users scale the performance appropriately. For example, if you have a million customers with one million data points each, you can say that you have one trillion data points. Imagine that at the peak, one thousand of those customers are looking at the dashboard simultaneously – essentially firing off one thousand concurrent queries against the database. Columnstore compression can store these trillion rows on a relatively small SingleStore cluster with approximately 20 nodes. Conservatively, SingleStore can scan 100 million rows per second per core, which mean that just one core can service 100 concurrent queries scanning one million rows each, and deliver sub-second results for analytical queries over raw data – below we will provide a benchmark for a columnstore query execution performance. Columnstore Query Execution A simple query over a columnstore table, such as a `GROUP BY`, can run at a rate of hundreds of millions to over a billion data points per second per core. To demonstrate this, we loaded a public dataset about every airline flight in the United States from 1987 until 2015. As the goal was to understand performance per core, we loaded this into a single node SingleStore cluster running on a 4 core, 8 thread Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz. To repeat this experiment, download the data using the following bash script: mkdir csv for s in `seq 1987 2015` do for m in `seq 1 12` do wget http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_${s} _${m} .zip done done Create this table: CREATE TABLE ontime (  Year INT,  Quarter INT,  Month INT,  DayofMonth INT,  DayOfWeek INT,  FlightDate Date,  UniqueCarrier Varchar(100),  AirlineID INT,  Carrier Varchar(100),  TailNum Varchar(100),  FlightNum Varchar(100),  OriginAirportID INT,  OriginAirportSeqID INT,  OriginCityMarketID INT,  Origin Varchar(100),  OriginCityName Varchar(100),  OriginState Varchar(100),  OriginStateFips Varchar(100),  OriginStateName Varchar(100),  OriginWac INT,  DestAirportID INT,  DestAirportSeqID INT,  DestCityMarketID INT,  Dest Varchar(100),  DestCityName Varchar(100),  DestState Varchar(100),  DestStateFips Varchar(100),  DestStateName Varchar(100),  DestWac INT,  CRSDepTime INT,  DepTime INT,  DepDelay INT,  DepDelayMinutes INT,  DepDel15 INT,  DepartureDelayGroups Varchar(100),  DepTimeBlk Varchar(100),  TaxiOut INT,  WheelsOff INT,  WheelsOn INT,  TaxiIn INT,  CRSArrTime INT,  ArrTime INT,  ArrDelay INT,  ArrDelayMinutes INT,  ArrDel15 INT,  ArrivalDelayGroups INT,  ArrTimeBlk Varchar(100),  Cancelled INT,  CancellationCode Varchar(100),  Diverted INT,  CRSElapsedTime INT,  ActualElapsedTime INT,  AirTime INT,  Flights INT,  Distance INT,  DistanceGroup INT,  CarrierDelay INT,  WeatherDelay INT,  NASDelay INT,  SecurityDelay INT,  LateAircraftDelay INT,  FirstDepTime Varchar(100),  TotalAddGTime Varchar(100),  LongestAddGTime Varchar(100),  DivAirportLandings Varchar(100),  DivReachedDest Varchar(100),  DivActualElapsedTime Varchar(100),  DivArrDelay Varchar(100),  DivDistance Varchar(100),  Div1Airport Varchar(100),  Div1AirportID INT,  Div1AirportSeqID INT,  Div1WheelsOn Varchar(100),  Div1TotalGTime Varchar(100),  Div1LongestGTime Varchar(100),  Div1WheelsOff Varchar(100),  Div1TailNum Varchar(100),  Div2Airport Varchar(100),  Div2AirportID INT,  Div2AirportSeqID INT,  Div2WheelsOn Varchar(100),  Div2TotalGTime Varchar(100),  Div2LongestGTime Varchar(100),  Div2WheelsOff Varchar(100),  Div2TailNum Varchar(100),  Div3Airport Varchar(100),  Div3AirportID INT,  Div3AirportSeqID INT,  Div3WheelsOn Varchar(100),  Div3TotalGTime Varchar(100),  Div3LongestGTime Varchar(100),  Div3WheelsOff Varchar(100),  Div3TailNum Varchar(100),  Div4Airport Varchar(100),  Div4AirportID INT,  Div4AirportSeqID INT,  Div4WheelsOn Varchar(100),  Div4TotalGTime Varchar(100),  Div4LongestGTime Varchar(100),  Div4WheelsOff Varchar(100),  Div4TailNum Varchar(100),  Div5Airport Varchar(100),  Div5AirportID INT,  Div5AirportSeqID INT,  Div5WheelsOn Varchar(100),  Div5TotalGTime Varchar(100),  Div5LongestGTime Varchar(100),  Div5WheelsOff Varchar(100),  Div5TailNum Varchar(100),  key (AirlineID) using clustered columnstore ); Then load data into the table: `load data infile '/home/memsql/csv/*' into table ontime fields terminated by ',' enclosed by '"' lines terminated by ',\n' ignore 1 lines;` Once the data is loaded, run a simple group by command. The following query performs a full table scan: SELECT OriginCityName, count(*) AS flights FROM ontime GROUP BY OriginCityName ORDER BY flights DESC LIMIT 20; On a machine with 4 cores, a 164 million row dataset query runs in 0.04 seconds which is 1 billion rows per second per core. No, that’s not a typo. That’s a billion rows per second per core. More complex queries will consume more CPU cycles, but with this level of baseline performance there is a lot of room across a cluster of 8, 16, or even hundreds of machines to handle multi-billion row datasets with response times under a quarter of a second. At that speed, queries appear to be instantaneous to users, leading to great user satisfaction. Try this example using SingleStoreDB Self-Managed 6. New vectorized query execution techniques in SingleStoreDB Self-Managed 6, using SIMD and operations directly on encoded (compressed) data, make this speed possible. Efficient Data Isolation Per Customer Data warehouses such as Redshift and Big Query support large scale, but may not sufficiently isolate different queries in highly concurrent workloads. On top of that, both have a substantial fixed overhead on a per query basis. Redshift in particular does not support many concurrent queries: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html. Depending on the analytical requirements, SingleStore allows for an ordered and partitioned physical data layout to ensure only scanned data belongs to a single customer. In our example, the columnstore was clustered on AirlineID. SingleStore supports clustered columnstore keys that allow global sorting of columnstore tables. In this case, if you have a predicate on AirlineID a user will only scan the subset of data belonging to that airline. This allows SingleStore to deliver on very high concurrency (in the thousands of concurrent queries) with each query scanning and aggregating millions of data points. More on Query Execution At SingleStore, we are continuously innovating with new query processing capabilities. This is a list of recent innovations in our shipping product: https://archived.docs.singlestore.com/v6.0/release-notes/memsql/60-release-notes/. Bringing it All Together Going back to our original example, though our dataset is one trillion rows, because of the clustered columnstore key, each customer only needs to scan through one million rows. For a simple query like the above, scanning 500 million rows per second per core means that a single CPU core could support 500 concurrent queries and deliver sub-second performance. To recreate the work mentioned in this blog, try out SingleStoreDB Self-Managed 6: singlestore.com/free.
Read Post
Arrays – A Hidden Gem in SingleStore
Engineering

Arrays – A Hidden Gem in SingleStore

Released this March, SingleStoreDB Self-Managed 6 Beta 1 introduced SingleStore Procedural SQL (MPSQL). MPSQL supports the creation of: User-Defined Functions (UDFs)Stored Procedures (SPs)Table-Valued Functions (TVFs)User-Defined Aggregate Functions (UDAFs) A Hidden Gem: Array Types There’s a hidden gem in SingleStoreDB Self-Managed 6 Beta 1 that we didn’t document at first — array types!  These make programming much more convenient. Since we compile your extensions to machine code, the performance is fantastic. And you don’t have to leave the comfort of your database language to get it. To declare and initialize a new array of 10 integers, you simply say `declare a array(int) = create_array(10);` You might we wondering, how does SingleStore know what kind of array to create when you call the `create_array()` function?  The secret is type inference. Our compiler knows the data type of the variable you’re assigning the result of `create_array()` to. It uses that to determine the data type of the array. Our arrays are zero-based, meaning the positions of the array of 10 integers above are numbered 0..9. Real Examples Let’s try some real examples. Here’s a function to create an array of size n containing a pseudo-random sequence of numbers: delimiter // create function random_array(n bigint) returns array(bigint not null) as declare result array(bigint not null) = create_array(n); a bigint not null = 573829473; b bigint not null = 837562837; m bigint not null = 100000000; crnt bigint not null = 17; begin for i in 0 .. (n - 1) loop crnt *= a; crnt += b; crnt %= m; result[i] = crnt; end loop; return result; end // delimiter ; Suppose you want to display the array — you’ll need to convert it to text: delimiter // create function to_text(a array(bigint not null)) returns longtext as declare result longtext = "["; comma bool = false; begin for i in 0 .. (length(a) - 1) loop if comma then result = concat(result, ", "); end if; comma = true; result = concat(result, a[i]); end loop; return concat(result, "]"); end // delimiter ; Let’s put it all together: memsql> select to_text(random_array(5)); +----------------------------------------------------+ | to_text(random_array(5)) | +----------------------------------------------------+ | [92663878, 16439131, 15870800, 37651237, 23070938] | +----------------------------------------------------+ 1 row in set (0.37 sec) SingleStore dynamically compiles code in functions. The first time (or sometimes first few times) that you run them, they can take a little longer to run due to compilation. Hence the 0.37 sec time above. Let’s see what happens if we run it again: memsql> select to_text(random_array(5)); +----------------------------------------------------+ | to_text(random_array(5)) | +----------------------------------------------------+ | [92663878, 16439131, 15870800, 37651237, 23070938] | +----------------------------------------------------+ 1 row in set (0.00 sec) So, after we compile the function, it runs in milliseconds! Speaking of performance, in our tests, a merge-sort UDF written in MPSQL can sort an array of 200,000 numbers in 0.2 seconds. Try doing that in another database language! Multidimensional Arrays You can declare multidimensional arrays too, like so: `declare b array(array(int));` Here’s an example function that creates a two-dimensional array and sums the contents of all the elements:> delimiter // create or replace function md_array_sum() returns int as declare b array(array(int)); s int = 0; begin b = [[0, 100], [100, 0]]; for i in 0 .. length(b) - 1 loop for j in 0 .. length(b[i]) - 1 loop s += b[i][j]; end loop; end loop; return s; end // delimiter ; Notice that it creates an array literal [[0, 100], [100, 0]] and assigns it to b. This is another way to make an array, as an alternative to create_array(). Our multi-dimensional arrays are called “ragged arrays” because individual sub-arrays can contain different numbers of elements, although we didn’t do that in this example. Let’s take it for a spin: memsql> select md_array_sum(); +----------------+ | md_array_sum() | +----------------+ | 200 | +----------------+ 1 row in set (0.00 sec) Try SingleStoreDB Self-Managed 6 Beta Arrays are an amazing addition to SingleStore. You can learn more about arrays and SingleStore Procedural SQL in our online documentation. Download and try SingleStoreDB Self-Managed 6 Beta 1 today. Enjoy!
Read Post