Understanding PostgreSQL’s Data Fragmentation Problem, and How SingleStoreDB Is Better


Ankit Goyal

Pre-Sales Lead, MENA & GSI's

Understanding PostgreSQL’s Data Fragmentation Problem, and How SingleStoreDB Is Better

PostgreSQL is a relational database management system with a client-server architecture.

At the server- side, PostgreSQL’s processes and shared memory work together and build an instance, which handles data access. Client programs connect to the instance, and request read and write operations.

Let's talk  about PostgreSQL first, and look at the fragmentation problem it has:

what-is-fragmentationWhat Is Fragmentation?

Fragmentation is a database server feature that allows you to control where data is stored at the table level. Fragmentation enables you to define groups of rows or index keys within a table according to some algorithm or scheme.

Fragmentation is often called bloating in PostgreSQL. It relates to its implementation of MVCC (Multi-version Concurrency Control) where rows are not updated in place or directly deleted, but rather copied with a different ID. Those rows are then made visible or invisible depending on the transaction looking at the data. Basically, any update on the table is a Delete – Insert where the existing row is first deleted and a new row is inserted.

mvcc-multi-version-concurrency-controlMVCC (Multi-version Concurrency Control)

Multi-version concurrency control (MVCC) is a concurrency control method commonly used by database management systems to provide concurrent access to the database, and in programming languages to implement transactional memory. Without concurrency control, if someone is reading from a database at the same time someone else is writing to it, it is possible that the reader will see a half-written or inconsistent piece of data.

Postgres maintains data consistency by using a multi-version model. This means that while querying a database, each transaction sees a snapshot of data (a database version) as it previously existed — regardless of the current state of the underlying data. This protects the transaction from viewing inconsistent data that could be caused by (other) concurrent transaction updates on the same data rows, providing transaction isolation for each database session.

fragmentation-example-bloatingFragmentation Example (Bloating)

Let’s take a simple example to understand what bloating is, how it occurs and how to get rid of it.

Step 1: Create a simple table with one column. I have created a table called datafreg which has an id column, and is also a primary key column as shown here. You can see the specifics of the table with the \d <table_name> command.

Step 2: Generate some sample data in the table. I have used the generate_series function and inserted 4,000 rows. You can confirm the data inserted by using the count (*) on the table as shown here:

Step 3: Query pg_stat_all_tables to check the details on live and dead tuples. Any rows inserted in the table will be shown as live tuples — since it’s live data under a table and any rows you delete will be shown under dead tuples. As we have inserted 4,000 records, 4,000 live tuples are available for the table (shown here). 

Step 4: Also, check the size of the table along with the count. Here, I have used the pg_size_pretty function to check the table size.  There are a lot of ways to do this, and you can use any of them. As shown, the size of the table is 416 KB.

Step 5: Delete some table rows. Here, I have deleted 2,000 rows using the between clause. After deleting the rows, check the tuples again — since you have 2,000 rows in the table, it will show up under live tuple. The 2,000 rows you deleted are shown under dead tuple.

Step 6: After deleting the rows, check the size and count of the table again. As shown here, you can see the count has reduced to 2,000 but the size of the table remains the same. It was 416 KB (Step 4) initially and even after deleting 2,000 rows, the table size remains the same.

This is called bloating. Even after deleting the data, the space used by the deleted data is not released from the table. It will remain in the table, and the database will think that data is using the space. In reality, it’s just an impression and no data exists since we already took the steps to delete it. 

To free up the unutilized space and defragment the table, you will need to periodically perform VACUUM on the tables. Full VACUUM on the table puts the entire table under lock, and it is as good as downtime. So any tables with a high volume of DML’s happening on it should be vacuumed periodically, ensuring performance degradation does not happen and the unused space is released to the file system.

Hence, PostgreSQL is not ideal for an OLTP system which has high transaction volume, as over a period of time your database performance will start to decline.

Next, let's take a look at SingleStoreDB and how it handles fragmentation.

what-is-single-store-dbWhat Is SingleStoreDB?

SingleStore is a real-time,  distributed SQL database that handles both analytical (OLAP) and transactional (OLTP) workloads in the same table type. SingleStoreDB provides fast ingestion and high query speed for complex OLTP and OLAP queries. It provides a robust, scalable solution that is levels above what other legacy single node databases can do. In addition to Singlestore Helios, there is also a managed service that can be deployed on AWS, GCP or Microsoft Azure.

optimize-table-in-single-store-dbOPTIMIZE TABLE in SingleStoreDB

SingleStoreDB automatically runs optimization routines on tables, but they can also be started manually. Table memory can be freed when the optimize table command is run.

SingleStoreDB supports online optimize table, which means you can read and write while the table is being optimized. Optimize table` on a sharded table is always executed online.

automatic-background-optimizationAutomatic Background Optimization

 SingleStoreDB has background threads that automatically optimize columnstore indexes as needed. These optimizations are run in a single thread to minimize impact to concurrent query workloads. The background optimization will attempt to keep the number of row segment groups low, but will not attempt to create a single row segment group due to the cost of this operation.

Let's look at an example of how easy this is to execute in SingleStoreDB.

As shown in the following image, I have a table called events with 863 Million records:

If I run the command in real time to get the number of records, you can see the query comes out in only 0.2 seconds:

Next, let’s run the optimize table command online when other transactions in the tables are running:

You can see the table was optimized under 0.3 seconds without any kind of downtime needed or any locking established on the table. Now, the most interesting thing here is that I am using the smallest SingleStoreDB cluster, S-00,  to demonstrate this example.

One of the many  reasons SingleStoreDB can achieve this is because it offers ground-breaking Universal Storage and SQL-based execution built around a scalable, distributed architecture and lock-free data structures.

So what are you waiting for? Try SingleStoreDB for free, and let us know your experience!