Referential Integrity Checks in SingleStore

TG

TJ Gibson

Commercial Solutions Engineer

SingleStore is a real-time data platform with powerful, scalable operational and transactional capabilities for intelligent applications and analytics. SingleStore’s patented Universal Storage engine provides real-time ingestion with ultra low latency and high concurrency transactional, analytical, vector, semantic or hybrid queries against memory optimized columnstore or rowstore tables.

Referential Integrity Checks in SingleStore

This gives SingleStore substantial advantages in use cases where data needs to be transacted, analyzed or contextualized in near real time. We call this hybrid transactional-analytical processing (HTAP).

Some of the most common questions with transactional and HTAP use cases — or even migrations from MySQL — center around referential integrity and how to support foreign key relationships across a sharded system where foreign keys aren’t common. This blog answers that question by looking at a simple catalog, subscriber and subscription example to demonstrate how to uphold foreign key relationships and high-performance distributed SQL via application inserts and in pipelines without defining foreign keys in DDL.

For a detailed overview of how to handle multiple unique keys in SingleStore with a similar approach, check out this blog by SingleStore and Laravel community expert Jack Ellis. For a detailed overview of how SingleStore is optimized for transactional workloads, see “Build Faster Transactional Apps With SingleStore: Part I and Part II.”

catalogs-subscribers-and-subscriptions-caseCatalogs, subscribers and subscriptions case

Imagine an application that drives delivery of subscribed content via email. Managing this mail list might reasonably entail:

  • Ensuring that all inserts catalogs belong to an organization
  • All subscribers belong to an org that does not yet contain the subscriber’s email
  • Al subscriptions belong to a distinct subscriber and catalog

This blog shows how SingleStore can easily handle enforcing this relationship with app-logic and without foreign keys.

Initial design considerations

At the core, referential integrity is the process of managing relationships between tables in a database. In most relational databases like MySQL or PostgreSQL, foreign keys and unique keys are an accepted performance hit to guarantee bad inserts do not commit. First and foremost, it is crucial to call out that referential integrity typically is, and always should be, enforced at the application layer. No database key or process should solely enforce these relationships — it is the application's job to ensure the consistency of the ACID transaction written. Application logic should:

  • Enforce uniqueness of the required fields datatype, and save the primary key for unique update/upsert record
  •  Ensure inserts contain the required relational key values
  •  Encode the logic of the uniqueness or referential check in the insert statement or pipeline procedure

With proper application-side logic in place, foreign key relationships or uniqueness has already been determined; the keys are there as insurance to enforce this relational integrity across tables and make sure no bad data is inserted. This is particularly true in microservices-based architectures where eventual-consistency is an inherent design principle. Yet, we lean on OLTP-only solutions to enforce these keys as insert volume increases and accept the performance hit.

SingleStore is designed to minimize latency and maximize insert volume across a sharded three-tier storage model. SingleStore is MySQL wire protocol compliant, but it is not a MySQL build. There are many features in SingleStore not in MySQL, and a few in MySQL that are not supported in SingleStore (outlined here). Notably for this blog’s purpose: 

  • SingleStore supports only one enforced primary or unique key
  • SingleStore does not currently support foreign keys

Managing foreign keys or multiple unique keys adds operational overhead that is not typically worthwhile compared to adding checks into the insert itself — especially when looking to dramatically speed up transactions — as outlined in the Fathom blog on multiple unique column example. Outside of application level logic and keys in a table, unenforced unique constraints can help flag suspected violations that make it through the logic.

In instances where there are multiple unique key requirements and uniqueness of a field is guaranteed by datatype (like a hash-value or auto_incremented value), consider removing or un-enforcing the unique key on these values and assign the primary key based on update and upsert requirements. Other examples describe inserting into one table without uniqueness constraints rapidly, then using conditional selects to insert into a secondary table which is used for queries. 

Since these uniqueness checks have more examples, this blog focuses primarily on foreign key checks. SingleStore’s memory-optimized query engine ensures these filter based insert select statements uphold relationships while inserting data in near-real time.

step-1-loadingStep 1: Loading Orgs and Catalogs with conditional inserts

The first foreign key relationship we need to match is between Orgs and Catalogs, where write transactions should contain logic to ensure any insert to catalogs checks the 'org_id' for existence in orgs as id first. Figure 1 explains this relationship with an example of this insert as a nested transaction sent from the application. 

The first step is to create the orgs and load 10 initial records:

-- Orgs
DROP TABLE IF EXISTS orgs;
CREATE TABLE `orgs`
( `id` bigint(10) unsigned NOT NULL AUTO_INCREMENT,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
`name` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`send_cost` decimal(4,2) NOT NULL,
`active` tinyint(4) NOT NULL,
PRIMARY KEY (`id`) USING HASH,
SHARD KEY `__SHARDKEY` (`id`),
SORT KEY `__UNORDERED` ()
) ;
---- Insert 10 records manually to org table
INSERT INTO orgs (id, created_at, updated_at, name, send_cost, active)
VALUES
(1, '2023-10-01 00:00:00', '2023-10-01 00:00:00', 'PearStreet', 1.00, 1),
(2, '2023-10-15 00:00:00', '2023-10-20 00:00:00', 'YourFavCompnay', 2.00, 1),
(3, '2023-11-01 00:00:00', '2023-11-02 00:00:00', 'GetOutside', 3.00, 1),
(4, '2023-11-15 00:00:00', '2023-11-19 00:00:00', 'Partner, Partner, and Partner', 4.00, 1),
(5, '2023-12-01 00:00:00', '2023-12-01 00:00:00', 'Avenue5', 5.00, 1),
(6, '2023-12-15 00:00:00', '2023-12-20 00:00:00', 'Big Inc', 6.00, 1),
(7, '2024-01-01 00:00:00', '2024-01-02 00:00:00', 'Small Co', 7.00, 1),
(8, '2024-01-15 00:00:00', '2024-01-20 00:00:00', 'Acme Company', 8.00, 1),
(9, '2024-02-01 00:00:00', '2024-02-05 00:00:00', 'Wayne Enterprises', 9.00, 1),
(10, '2024-02-15 00:00:00', '2024-02-20 00:00:00', 'The Database Beat', 10.00, 1);

Next, create the Catalogs table:

DROP TABLE IF EXISTS catalogs;
CREATE TABLE `catalogs` (
`id` bigint(10) unsigned NOT NULL AUTO_INCREMENT,
`org_id` int(10) unsigned DEFAULT NULL,
`parent_id` int(11) unsigned DEFAULT NULL,
`name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`active` tinyint(4) NOT NULL DEFAULT '1',
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
SORT KEY `__UNORDERED` (),
SHARD KEY (id),
PRIMARY KEY (id) USING HASH
) ;

Let’s now mimic the application transaction in Figure 1, attempting to insert a new one-off row over a database command  into catalogs where the ‘org_id’ does not exist in orgs.id. The does not insert any records because no orgs.id is 9999.

INSERT INTO catalogs (id, org_id, name, active, created_at, updated_at)
SELECT 9999, 1, 'Villains Inc', 0, '2023-12-07 00:27:50', '2024-02-18 04:10:28'
FROM dual
WHERE EXISTS (SELECT 1 FROM orgs o WHERE o.id = 9999);
Select * from catalogs;

Inserts following the preceding, or Figure 1, can be batched into concurrent multi-insert statements with the app of choice — or batched in stored procedures in insert all statements. 

Perhaps the application has queues that push writes to an S3 bucket or Kafka topic as they are created. In this case, we can leverage a stored procedure to perform our referential checks before inserting records from a SingleStore pipeline into a table. The next section illustrates how to establish pipelines into stored procedures to stream data into catalogs, subscribers and subscriptions while checking for referential integrity. 

step-2-streamingStep 2: Streaming catalogs, subscribers and subscriptions with Pipelines into procedures

SingleStore pipelines provide real-time ingest by streaming data directly into partitions in parallel. Pipelines can be created to stream data directly into a table or into a stored procedure, where it can be manipulated before insertion. This procedural approach allows for referential integrity checks as described in Figure 1 directly in the stream. All three tables will follow this procedure and pipeline format:

--- Pipeline Into Stored Procedure Relationship Check Template
DELIMITER //
CREATE OR REPLACE PROCEDURE <SP_name>(
batch QUERY(
<table_fields coming from pipeline>
)
)
AS
BEGIN
INSERT INTO <table_name>()
SELECT <insert all columns>
FROM batch b
WHERE EXISTS (SELECT 1 FROM <table_name> WHERE b.<field> = <table_name.field>)
END //
DELIMITER;
----------------
--- Pipeline Template
CREATE or replace PIPELINE <Pipeline_name>
AS LOAD DATA -- <Link or Credentials>
BATCH_INTERVAL 2500
DISABLE OFFSETS METADATA GC
INTO PROCEDURE <Procedure_name>
FIELDS TERMINATED BY ',' optionally ENCLOSED BY '\"' ESCAPED BY '\\'
LINES TERMINATED BY '\n' STARTING BY '' NULL DEFINED BY ''
IGNORE 1 LINES -- if files contain headers ;

catalogsCatalogs

We can guarantee Catalogs belong to an existing org with the following procedure and pipeline:

-- Procedure `catalogs_insert_check`
DELIMITER //
CREATE OR REPLACE PROCEDURE catalogs_insert_check
(batch QUERY(
`id` bigint(10) unsigned NOT NULL,
org_id int(10) unsigned,
parent_id int(11) unsigned,
name varchar(255),
active tinyint(4),
created_at timestamp,
updated_at timestamp
)
)
AS
BEGIN
INSERT INTO catalogs(org_id, parent_id, name, active, created_at, updated_at)
SELECT org_id, parent_id, name, active, created_at, updated_at
FROM batch b
WHERE EXISTS (SELECT 1 FROM orgs WHERE b.org_id = orgs.id);
END //
DELIMITER;
------------------
-- Pipeline Into Procedure `catalogs_insert_check`
CREATE PIPELINE `catalogs`
AS LOAD DATA S3 's3://<Redacted>/*'
CONFIG '{"region": "us-east-1"}'
CREDENTIALS '{"aws_access_key_id": "<REDACTED>", "aws_secret_access_key": "<REDACTED>",
"aws_session_token": "<REDACTED>"}'
BATCH_INTERVAL 2500
DISABLE OFFSETS METADATA GC
INTO PROCEDURE `catalogs_insert_check`
FIELDS TERMINATED BY ',' ENCLOSED BY '\"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n' STARTING;

subscribersSubscribers

Inserts to subscribers first check if the subscriber exists in a valid organization,  then that their email is not already a subscriber in that org, as shown in Figure 2. This check handles referential integrity and uniqueness. 

The table DDL, stored procedure for checking inserts and pipeline for subscribers are outlined here:

-- DDL for Subscribers Table
CREATE TABLE `subscriptions` (
`id` bigint(10) unsigned NOT NULL AUTO_INCREMENT,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
`subscriber_id` bigint(10) DEFAULT NULL,
`catalog_id` bigint(10) DEFAULT NULL,
`upload_id` int(10) DEFAULT NULL,
`group_id` int(10) DEFAULT NULL,
`tags` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`last_delivered` timestamp NULL DEFAULT NULL,
`delivered_count` smallint(6) DEFAULT NULL,
`open_count` smallint(6) DEFAULT NULL,
`click_count` smallint(6) DEFAULT NULL,
SORT KEY `__UNORDERED` (),
SHARD KEY (id),
PRIMARY key (`id`)
) ;
------------------
-- Subscribers_insert_check Procedure
DELIMITER //
CREATE OR REPLACE PROCEDURE subscribers_insert_check
(batch query(
`id` bigint(10) unsigned NULL,
`created_at` timestamp NULL,
`updated_at` timestamp NULL,
`org_id` int(10) NULL,
`upload_id` int(10) unsigned NULL,
`email` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`first` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`last` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`street` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`city` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`company` varchar(128) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`state` char(2) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`zip` char(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`phone` varchar(30) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`complaint_count` int(11) NULL,
`last_open` datetime NULL,
`last_click` datetime NULL,
`lcount` tinyint(4) NULL)
)
AS
BEGIN
INSERT INTO subscribers(id,
created_at,updated_at,org_id,upload_id,email,first,last,street,city,company,state,zip,phone,complaint_count,last_open,last_
click,lcount)
SELECT b.id,
b.created_at,b.updated_at,b.org_id,b.upload_id,b.email,b.first,b.last,b.street,b.city,b.company,b.state,b.zip,b.phone,b.compl
aint_count,b.last_open,b.last_click,b.lcount
FROM batch b
WHERE EXISTS (SELECT 1 FROM orgs o WHERE b.org_id = o.id)
AND NOT EXISTS (SELECT 1 FROM subscribers s WHERE b.email = s.email AND b.org_id = s.org_id);
END //
DELIMITER;
------------------
-- Subscribers Pipeline
CREATE or replace PIPELINE `subscribers`
AS LOAD DATA S3 's3://<Redacted>/*'
CONFIG '{"region": "us-east-1"}'
CREDENTIALS '{"aws_access_key_id": "<REDACTED>", "aws_secret_access_key": "<REDACTED>", "aws_session_token":
"<REDACTED>"}'
BATCH_INTERVAL 2500
DISABLE OFFSETS METADATA GC
INTO PROCEDURE `subscribers_insert_check`
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n' STARTING BY ''
IGNORE 1 LINES
FORMAT CSV;

We can now validate that no org_id does not exist as an id in orgs, and that each subscribers’  email is unique for that an org_id as follows:

-- make sure all inserted `subscriber`.`org_ids` exist in `orgs`.`id`
SELECT count(*)
FROM `subscribers` AS s
WHERE NOT EXISTS
(SELECT 1 FROM `orgs` AS o WHERE s.`org_id`=o.`id`);
-- Make sure each email exists 1 time in org_id after pipeline load
SELECT COUNT(*) AS duplicate_count
FROM (
SELECT email, org_id
FROM subscribers
GROUP BY email, org_id
HAVING COUNT(*) > 1
HAVING COUNT(*) > 1
) AS dup_count;

We’ll set the stage with one last table, subscriptions with its own referential integrity requirements, then test the pipeline with purposefully bad batches.

subscriptionsSubscriptions

The table subscriptions stores information on subscriptions to different organizations' catalog items. All inserts must check that the subscription contains a valid catalog_id and valid subscriber id, and that the subscription to the catalog item is unique for this subscriber. This is outlined here in Figure 3

The table DDL, stored procedure for checking inserts and pipeline for subscribers are outlined here:

-- DDL for Subscribers Table
CREATE TABLE `subscriptions` (
`id` bigint(10) unsigned NOT NULL AUTO_INCREMENT,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
`subscriber_id` bigint(10) DEFAULT NULL,
`catalog_id` bigint(10) DEFAULT NULL,
`upload_id` int(10) DEFAULT NULL,
`group_id` int(10) DEFAULT NULL,
`tags` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`last_delivered` timestamp NULL DEFAULT NULL,
`delivered_count` smallint(6) DEFAULT NULL,
`open_count` smallint(6) DEFAULT NULL,
`click_count` smallint(6) DEFAULT NULL,
SORT KEY `__UNORDERED` (),
SHARD KEY (id),
PRIMARY key (`id`)
) ;
------------------
-- Subscribers_insert_check Procedure
DELIMITER //
CREATE OR REPLACE PROCEDURE subscribers_insert_check
(batch query(
`id` bigint(10) unsigned NULL,
`created_at` timestamp NULL,
`updated_at` timestamp NULL,
`org_id` int(10) NULL,
`upload_id` int(10) unsigned NULL,
`email` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`first` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`last` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`street` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`city` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`company` varchar(128) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`state` char(2) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`zip` char(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`phone` varchar(30) CHARACTER SET utf8 COLLATE utf8_unicode_ci NULL,
`complaint_count` int(11) NULL,
`last_open` datetime NULL,
`last_click` datetime NULL,
`lcount` tinyint(4) NULL)
)
AS
BEGIN
INSERT INTO subscribers(id,
created_at,updated_at,org_id,upload_id,email,first,last,street,city,company,state,zip,phone,complaint_count,last_open,last_
click,lcount)
SELECT b.id,
b.created_at,b.updated_at,b.org_id,b.upload_id,b.email,b.first,b.last,b.street,b.city,b.company,b.state,b.zip,b.phone,b.compl
aint_count,b.last_open,b.last_click,b.lcount
FROM batch b
WHERE EXISTS (SELECT 1 FROM orgs o WHERE b.org_id = o.id)
AND NOT EXISTS (SELECT 1 FROM subscribers s WHERE b.email = s.email AND b.org_id = s.org_id);
END //
DELIMITER;
------------------
-- Subscribers Pipeline
CREATE or replace PIPELINE `subscribers`
AS LOAD DATA S3 's3://<Redacted>/*'
CONFIG '{"region": "us-east-1"}'
CREDENTIALS '{"aws_access_key_id": "<REDACTED>", "aws_secret_access_key": "<REDACTED>", "aws_session_token":
"<REDACTED>"}'
BATCH_INTERVAL 2500
DISABLE OFFSETS METADATA GC
INTO PROCEDURE `subscribers_insert_check`
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n' STARTING BY ''
IGNORE 1 LINES
FORMAT CSV;

We can now validate that no org_id does not exist as an id in orgs, and that each subscribers’  email is unique for that an org_id as follows:

-- make sure all inserted `subscriber`.`org_ids` exist in `orgs`.`id`
SELECT count(*)
FROM `subscribers` AS s
WHERE NOT EXISTS
(SELECT 1 FROM `orgs` AS o WHERE s.`org_id`=o.`id`);
-- Make sure each email exists 1 time in org_id after pipeline load
SELECT COUNT(*) AS duplicate_count
FROM (
SELECT email, org_id
FROM subscribers
GROUP BY email, org_id
HAVING COUNT(*) > 1
) AS dup_count;

We can verify no records violating the filter were inserted with:

SELECT COUNT(*) AS invalid_rows_count
FROM subscriptions s
WHERE NOT EXISTS (SELECT 1 FROM catalogs c WHERE s.catalog_id = c.id)
OR NOT EXISTS (SELECT 1 FROM subscribers sub WHERE s.subscriber_id = sub.id)
OR EXISTS (SELECT 1 FROM subscriptions p WHERE s.id = p.id AND s.catalog_id != p.catalog_id AND s.subscriber_id !=
p.subscriber_id);

With this, our imagined subscription management use case has been set up to stream inserts into catalogs, subscribers and subscriptions tables respectively, while first performing the required referential integrity checks. 

trust-and-verify-with-subscriptionsTrust and verify with subscriptions

Let’s now dive deeper into subscriptions to see batches that should and should not insert, following our initial load of 2,000 subscription records. Figure 3 represents the attempted insert of the four rows below the line, where the preceding shows existing values.

Row 2001, 5, 2,... will insert because it has an existing subscriber and a distinct catalog item within. When looking at all other subscriptions for subscriber_id = 5. Row 2002, 5, 27, … does not insert because subscriber_id =5 contains an existing entry where catalog_id = 27. Row 2003, 3000, 27, … will not insert because there is no existing subscriber_id or id in the subscribers table with id = 3000. Finally, row 2004, 5, 77, … will not insert because there is no catalog_id or id in the catalogs table with id = 77. 

The logic of the insert checks has been previously validated, but let’s purposely add these four bad lines (in full) as a CSV file to our existing S3 bucket to see if the pipeline procedure rejects these inserts as expected.

After adding the new batch, called ‘BadLines_refint_subscriptions.csv, the running pipeline will automatically pick up this new file in the next batch interval (every 2.5 seconds as by default here).

We see four rows parsed; however, when we select into the table we can see only one successfully inserted:

SingleStore does not support foreign keys in DDL, but as demonstrated in this blog we can confirm using filters in insert…select… Statements can be a scalable manner to maintain referential integrity or uniqueness checks with inserts or stored procedures, and SingleStore pipelines.

taking-the-next-stepTaking the next step

SingleStore is a powerful database designed specifically to support operational queries, CRUD, AI and real-time streaming analytics across a scale-out distributed architecture with sub-second latency — even at high concurrency. While foreign keys are not currently supported, SingleStore can quickly perform these checks with logic in the insert, either in the transaction itself or in pipeline and procedure.

SingleStore’s best-in class approach is driven by its patented Universal Storage, a memory-optimized multimodal columnstore that optimizes for speed and concurrency across a clustered system:

Still skeptical? Don’t take our word for it; test it yourself. Spin up a free trial and follow along with this blog, or explore some of our other use-case specific walkthroughs here. At whatever size or scale you’re looking to test, we can help prove the value of multimodal HTAP for your transactional or analytical use case.


Share