Scaling Worldwide Parcel Logistics with SingleStore and Vectorized

CS

Carl Sverre

Senior Director, Engineering

Scaling Worldwide Parcel Logistics with SingleStore and Vectorized

Learn how SingleStore and Redpanda can work together to solve the operational complexity of global logistics. In this blog post we present a reference architecture using SingleStore’s relational database and Redpanda’s streaming platform in combination to scale worldwide parcel shipping to never-before-seen volumes.

scaling-worldwide-parcel-logistics-with-single-store-and-vectorizedScaling worldwide parcel logistics with SingleStore and Vectorized

Today, let's talk about how SingleStore and Redpanda can work together to solve the operational complexity of global logistics while handling 100x the number of packages delivered annually. SingleStore is a scale-out relational database built for data-intensive workloads. Redpanda is a Kafka API compatible streaming platform for mission-critical workloads created by the team at Vectorized. In this blog post we present a reference architecture using these two systems to scale worldwide parcel shipping to never-before-seen volumes.

Conveniently, logistics simulation has always been on my bucket list of things to build, so when the opportunity arose to create something with Redpanda and SingleStore it was an easy choice to make. Before we get into the reference architecture, let's talk about some of the challenges logistics companies face every day.

On October 12th, 2020, the Pitney Bowes Parcel Shipping Index reported that 103 billion packages were delivered in 2019 alone. That's an average of over 282 million packages delivered each day or 3200 packages delivered per second and we aren't slowing down! The same report forecasts those numbers will increase by 2-3x by 2026. These numbers get even more extreme when you consider peak rather than volume. For example, Amazon just announced that during the 48 hours of Prime Day 2021 over 250 million items were purchased. That's a peak rate of up to 5 million packages shipped per hour (1300 packages per second) which is half of the global average. It's clear from these statistics that we need logistics systems which can scale to handle unprecedented peak activity in order to handle the growing demand for e-commerce worldwide.

Scale isn't the only challenge in architecting a system to handle global logistics, complexity is another problem. As an example, let's consider one possible package life cycle.

Each of these parcel lifecycle steps involves a lot of hidden complexity. Consider moving packages between departure and arrival scans. During this transportation phase, a package needs to be routed, packed, shipped, unpacked, and handled countless times. At any point the system needs to be able to dynamically surface reasonable ETAs and location history to customers while ensuring that no data is lost and the package eventually arrives to its destination.

Due to the complex system demands of global logistics, within each company you will find a sprawling infrastructure composed of innumerable databases, event streams, data lakes, processing clusters, monitoring systems, and more. Collectively these systems process billions of transactions per day with the combined goal of ensuring that each package has a safe and on-time journey. Within this architecture, you will find the crown jewel - the package tracking system. While I don't work for a logistics company, based on my research and experience working with large scale data systems most of the package tracking systems out there use an event driven and service oriented architecture to ensure that they can scale to meet the required demand. In general an event driven architecture is focused on keeping track of events and a service oriented architecture focuses on manipulating and querying data at rest. It's common to see large organizations use both of these paradigms at the same time.

Setting out to build a simulation of global package logistics, I decided to focus on long-haul delivery. Basically, focus on moving packages long distances rather than worry about the last-mile handoff. Here is a diagram outlining the architecture I ended up with:

Within my architecture, packages are simulated by a scale-out simulation cluster. Each simulator thread generates thousands of packages and emits events into Redpanda topics as those packages are shipped from origin to destination. In order to create semi-believable package histories, the simulation code includes a rudimentary routing algorithm as well as a virtual clock. Let's look at an example package history using SingleStore Studio:

Once the events arrive in Redpanda topics, SingleStore immediately loads them using Pipelines in batches of up to 100 million records. Due to SingleStore committing the offset metadata along with the data itself in a single distributed transaction, we can ensure that each record is loaded exactly once. This combined with Redpanda's built-in durability drastically reduces the complexity of ensuring that no data is lost in the process.

It's worth highlighting the data model used to efficiently process and store the data in SingleStore. We use two columnstore tables to store the raw package and transition records. These tables provide on-disk compression without sacrificing performance which is perfect for these mostly historic datasets. But, in order to handle packages in transit, we need a storage engine which is optimized for a high volume of inserts, updates, and deletes. This is why we use a rowstore table to store the state of all undelivered packages. Finally, we use a reference table to store the mostly static locations dataset which contains information about each of our hub and point locations in our global logistics network. A reference table is replicated on every node in the cluster so that all queries are able to access location data without having to broadcast it during query execution. Here is a visualization of the data model to clarify the relationships between each of our tables and the incoming data Pipelines.

One interesting feature of our schema is the process_transitions() stored procedure. This procedure is responsible for keeping the package_transitions and package_states tables in sync with the latest transitions loaded from Redpanda topics. Because no engineering blog post is finished without a code sample, here is the procedure in its entirety:

CREATE OR REPLACE PROCEDURE process_transitions(batch QUERY(
    packageid CHAR(36) NOT NULL,
    seq INT NOT NULL,
    locationid BIGINT NOT NULL,
    next_locationid BIGINT,
    recorded DATETIME NOT NULL,
    kind TEXT NOT NULL
))
AS
BEGIN
    REPLACE INTO package_transitions (packageid, seq, locationid, next_locationid, recorded, kind)
    SELECT * FROM batch;

    INSERT INTO package_states (packageid, seq, locationid, next_locationid, recorded, kind)
    SELECT
        packageid,
        seq,
        locationid,
        next_locationid,
        recorded,
        statekind AS kind
    FROM (
        SELECT *, CASE
            WHEN kind = "arrival_scan" THEN "at_rest"
            WHEN kind = "departure_scan" THEN "in_flight"
        END AS statekind
        FROM batch
    ) batch
    WHERE batch.kind != "delivered"
    ON DUPLICATE KEY UPDATE
        seq = IF(VALUES(seq) > package_states.seq, VALUES(seq), package_states.seq),
        locationid = IF(VALUES(seq) > package_states.seq, VALUES(locationid), package_states.locationid),
        next_locationid = IF(VALUES(seq) > package_states.seq, VALUES(next_locationid), package_states.next_locationid),
        recorded = IF(VALUES(seq) > package_states.seq, VALUES(recorded), package_states.recorded),
        kind = IF(VALUES(seq) > package_states.seq, VALUES(kind), package_states.kind);

    DELETE package_states
    FROM package_states JOIN batch
    WHERE
        package_states.packageid = batch.packageid
        AND batch.kind = "delivered";
END

As you can see from the code, the procedure performs three operations. The first copies the incoming batch into the package_transitions table allowing us to access the full history of every package. The second operation upserts the batch into the package_states table taking the sequence number into consideration. Merging incoming records based on sequence numbers ensures correct state even if we process transitions slightly out of order within a single batch. The third operation deletes any delivered packages from the package_states table. Deleting delivered packages ensures that the table only handles packages which are currently in transit which keeps our memory usage stable over time. The stored procedure is executed transactionally by SingleStore Pipelines as batches of data become available in Redpanda, in real time.

With our simulator cluster humming along and our data model in place we come to the last piece of the puzzle - data visualization. For this demo I used Grafana and created two main dashboards. The first dashboard showcases the full performance of the simulation. We focus on three main statistics: package transitions per second, Redpanda's ingress/egress, and the ingest rate into SingleStore. In the screenshot below you can see that with a single simulator running on my desktop machine I am able to deliver up to 20k packages per second (which is roughly 6x the volume of global package delivery in 2020).

The second dashboard showcases the actual data. Since the simulator runs much faster than real time (see the simulator clock in the top right), the per-second rate of deliveries on this dashboard is much lower. Think of it as spreading out roughly twenty thousand deliveries per second over 100 virtual hours in order to ensure that the data makes some amount of sense (i.e. packages don't fly around the world in 3 seconds).

In addition to surfacing some metrics about the actual data, this dashboard also exposes two maps showcasing real time inventory and current package destinations. These maps are computed on the fly from SingleStore and can be refreshed in real time as packages travel from origin to destination.

Now that we have gone over how the simulation works, let's revisit the claim that we can handle 100x the volume of global package delivery per year. That means we need to handle at least 300k deliveries per second. In order to accomplish this I deployed this reference architecture to Google Cloud resulting in this beautiful image of raw computational performance:

As you can see from this screenshot, the simulation is processing data at roughly 1.5 GB/s ultimately leading to almost 400k deliveries per second. This is roughly 125x the global package delivery volume in 2019 easily exceeding the goal I originally set for myself. The cluster running this simulation included 4 Redpanda brokers (16 vCPUs & 64 GB RAM each) and 16 SingleStore leaf nodes (32 vCPUs & 128 GB RAM each).

In closing, I want to thank the team at Vectorized for helping me squeeze every ounce of performance out of this reference architecture, and the entire engineering team at SingleStore for building such a capable database. Peace!

Any technical blog post is only as good as the code, so I encourage you to check it out here: https://github.com/singlestore-labs/singlestore-logistics-sim.  You can spin it up on your computer using Docker or you can deploy it directly to Google Cloud using Terraform by following the instructions in the readme.

The whiteboard images in this post were all created in Excalidraw - one of my favorite companies ever!

To try SingleStore for yourself you can:

To try Redpanda you can download it here.


Share