Multi-tenancy in SingleStoreDB

VR

Vinicius Ruoso

CTO at Leads2B & SingleStore Ambassador

Multi-tenancy in SingleStoreDB

Multi-tenancy — a very early decision for any company or piece of software that aims to serve multiple customers at the same time, each with their own data — is hard. Multi-tenancy is common in SaaS companies, and if you don’t have time to stop and really think about this decision at the beginning of your project, you will have to live with that for some time — and migrating from one strategy to another might not be as simple as you’d expect.

Let’s dive into some of the models that commonly use multi-tenancy. Keep in mind that the ideas and examples here are geared toward relational databases, but the same applies to most kinds of datastores available.

taking-a-closer-look-at-multi-tenancy-modelsTaking a closer look at multi-tenancy models

One schema: multi-tenancy by tenant column

In this model, you have one master schema and one version of each of your tables. All customer data is stored in the same set of tables, independent of the customer. To distinguish data between customers, every table has a tenant ID column (or whatever you prefer to call it).

This means that for every query you develop in your application, you want to ensure it has that tenant ID column filtered to the correct customer issuing the request. There are a lot of ways to ensure this accuracy, but it’s very dependent on the language and frameworks you are using to access the data. As a rule of thumb, you’ll want to write several tests around that logic to ensure that even when a developer makes a mistake, data from one customer will not show up when another customer is using the app (since this can be really harmful to your company’s reputation).

Apart from the app logic to make sure you are querying the correct data, there are performance and maintenance considerations as well. If all data is within one table that will be served by one database server (in most cases), you can run into a few problems:

  1. There can be a “noisy neighbor”— one enterprise customer that imports a lot of data and has a lot of users, where the database resources are not properly divided and customers can be compromised.
  2. Depending on the database you’re using, you may experience “locking.” When running transactions or DDL queries, you risk locking the table for minutes. It’s also not trivial to dedicate database resources to a certain customer in that scenario.

When using SingleStoreDB, multi-tenancy by tenant column is a very good approach. Keep in mind that as the database is distributed, you want to think about a good shard key so data is evenly distributed across the cluster nodes. This can be achieved by using the tenant ID with some extra columns. Using only the tenant ID is not a good idea, since queries from one tenant will be concentrated on a small number of nodes.

Blog post: Implementing a Scalable Multi-tenant Service

Tables can be very big, but Columnstore allows you to query that data super fast. By using clever techniques, like compression, queries executed in tables with billions of rows can be very fast. That store model has been evolving steadily throughout the years and is moving toward the Universal Storage concept, where you have a table type that is capable of running analytical queries (OLAP) and high transactional (OLTP) workloads.

Changing the table schema is also something that is very easy with SingleStoreDB since all operations are lock-free. You can easily create new indexes, alter columns and more without worrying too much about your application being unavailable. Of course, you need to keep in mind the processing and memory requirements necessary to perform those operations.

One schema per tenant

In this model, each tenant has a schema with its own set of tables. It provides a higher level of isolation between tenants, but you will end up with a lot more tables. Consider the number of tables your application needs, and also the number of customers you expect to use your product.

You can serve each schema from a completely different database server for each tenant, or you can serve them from just one and distribute the load as you grow the user base. See the operations section for more on that.

Another downside of this approach is that you will end up with a high number of tables in your database server. That number can easily reach millions of tables — consider an app with 100 tables and 10,000 tenants. Most databases do not work well with such a high number of tables. For instance, MySQL will suffer because it has multiple caches that keep information about tables in memory as it has to load the table definitions used on each connection. If you find yourself in that scenario, try optimizing the table cache vs. buffer memory so you have a good balance, and the server is stable.

SingleStoreDB has a very different approach to support schemas. Since it’s designed to support data-intensive applications, there exists a notion that there won’t be a large number of schemas in the database cluster. Each schema will have a huge amount of data (much more now with unlimited storage support), and the goal is to support transactional and analytical queries on tables with literally billions of rows.

For each schema, there is memory allocated to support various buffers, caches and transaction logs, as well as pre-allocated disk space for each partition. This can easily and quickly fill your cluster disk if you create hundreds of databases in a cluster with default settings. You get 256MB per partition, per database only for transaction logs, and it is recommended to have one partition per CPU for better performance. That means for a 32 CPU cluster you need roughly 8GB per database, just to sit there without any data.

That becomes impractical if you need several databases — so unless you know you will have very few tenants and can live with the overhead, this is not a good approach for SingleStoreDB. There are some system variables that control the size of those files, so you can reduce them in any case.

Blog post: Database Multi-tenancy in the Cloud and Beyond

One schema per tenant (via tables)

Another alternative to the previous approach is this: Instead of having multiple schemas, you can have one (or fewer) schema and separate customer data into different tables. This will leave you with a few schemas that have a large number of tables. It also means you have almost the same data isolation, without many of the same downsides you would get with locks on DDL operations. But it’s also important to keep in mind that customers in the same schema are also sharing the same resources, so the “noisy neighbor” problem still exists.

This approach works better with SingleStoreDB since tables have a much smaller overhead than databases — but don’t expect to have millions of tables. The table implementation offers extremely similar isolation to what you would get using databases. And since each table is unique, queries to multiple tenants will have to be compiled for each one, even when the query is exactly the same except by using different tables. This also has some effect on other databases, as the engine needs to somehow figure out the table schema to run the query — and having multiple tables can mess up the caches.

hybridHybrid

All the models previously mentioned can be mixed together to help you achieve your goals. For instance, you can have any number of servers, with each server having a set of tables for tenants in the “one table per tenant” schema, multiple servers with the tenant column in all tables, and so on. You get the idea. That can be a solution for a problem you have in your current setup — like adding extra servers. In some cases, you can adapt your application to the hybrid model very easily. And in others, it can be difficult depending on your stack and the number of services that need to be changed.

With SingleStoreDB, it is not very cost-effective to have separate clusters for the same application since each cluster has at least one master and a few aggregators. To scale the cluster, you can add more and more leaf nodes adding CPU and RAM to the problem, simplifying the operation and application code as you do not need to handle the logic of figuring out where your customer data is located before accessing it. You won’t be able to escape that if you need to serve customers globally, so you can be prepared if you want to reach that goal.

how-single-store-db-features-support-a-multi-tenancy-approachHow SingleStoreDB features support a multi-tenancy approach

There are a few key SingleStoreDB features that can help you make a decision on your multi-tenant strategy.

SingleStoreDB’s bottomless architecture enables you to copy data to an object store (S3 compatible) asynchronously. This means that the data loaded in the cluster nodes can be totally copied to S3 when you don't need to use that database anymore, and lazy-loaded when you need it again. The feature to attach/detach a database can be helpful in a lot of scenarios, depending on your specific use case. Of course, there is a latency on those commands because you need time to get data from the object store and load it in memory, and that will depend on the tables used and the data size.

Other noteworthy features of SingleStoreDB that help with the “noisy neighbor” problem are the workload management and resource pool features. Workload management automatically helps your cluster perform better when there are spikes in concurrency by prioritizing cheap queries. The resource pool feature allows you to create a pool of resources by specifying CPU, memory and concurrency of queries in that pool — limiting the effect those queries can have on the entire cluster.

Finally, row-level security at the database level can help mitigate any security concerns with the tenant column approach.

operationsOperations

Some of these multi-tenancy models make day-to-day operations as simple as running one query, or as complicated as lots of tooling around to manage your tenants. Let’s dive into four real-life scenarios where the choice you make on day one can impact your operations later down the road.

  1. Onboarding new customers: What are the steps you need to perform to get a new customer or company to use your service? If you want to provide a free tier and allow new users to subscribe, you should prepare for that — and the database model you choose will impact the time it takes you to do so. The tenant ID column approach is the easiest here; you don’t have to do anything special, it is only a new customer with a new tenant ID, and life goes on. If you have multiple servers and tenants distributed among them, you need a catalog to map tenants to database servers, and some sort of logic to decide where a new customer should be placed.

    The other approaches all involve some sort of replication logic as well. If you have one schema per customer, when a new customer wants to use your product, you need to create that schema and all associated tables (using the best tools available in your database). This can take some time depending on the database and number of tables, so you may need a buffer of unused schemas on standby. You also need a base schema to copy the structure from, usually without any data on it. Again, you need a catalog to determine which schema belongs to which customer.

  2. Schema updates: Usually developers will have database migrations to control database schema changes like adding columns, tables, indexes, etc. This is easy to execute if you have one schema, but remember that depending on the database engine you are using, some DDL operations might lock your tables for some time depending on the data size. Luckily for SingleStoreDB, this is not the case as operations are lock-free.

    If you have a lot of database servers and schemas to run your migration script, this can start to become a complicated process. You need to make sure that all the schemas are in sync with each other, or your application will eventually fail for some of your customers. You also need to make sure that the application code can handle the schema change gracefully while you are running the migration, because some schemas will be migrated while some will not. And finally, if you are onboarding new customers while the schema is changing, make sure you are running the script first in the base schema (the one that is copied for each new customer).

  3. Customer Analytics: As a SaaS company, you need to analyze your customers’ behavior and product usage. This drives a lot of decisions on how to improve the product for customer engagement and retention. As you can imagine, this will vary a lot based on the multi-tenancy model being used.

    If all data is in the same location, simple queries will allow you to analyze all customers at the same time, grouping by the tenant ID. In cases where data is separated in multiple servers, schemas and tables, you basically need to analyze each customer separately and consolidate that somewhere — something that requires a lot more work. It also takes more time to run 10,000 queries than one query (independent of its complexity), so it is a lot harder to get a snapshot of your customer base at any one point in time.

  4. Data localization and migration: If the application is used by customers around the world, you probably will end up with multiple servers located close to them. As you grow and add more locations, data migration can be necessary to minimize latency. And in some models, you may need to do migrations based on the performance needs of the customers — like migrating a large enterprise customer consuming all resources on the current server to a new one with more resources, or fewer customers.

    Migrating tenant data from one location to another usually incurs some downtime for that particular customer while you transfer the data. If you allow transactions to be applied in the origin server while the migration is in progress, you will likely end up with an inconsistent copy of the data on the other side. I would say that this is one of the biggest pain points companies with a large, distributed customer base will have in their data tier as getting all details of that orchestration is hard. The solution? Try to automate as much of the process as possible with as many fail-safes in place to prevent data loss.

With robust functionality and distributed, relational model, SingleStoreDB delivers what most multi-tenancy approaches require: Reliable, elastic scalability, high query ingest and concurrency and the ability to deliver real-time analytics on constantly changing data.

Try it out for free

Get started on SingleStoreDB

___________________________________

References:


Share