Sorted Tables for Faster Range-Scans

Most developers will use an automatically incrementing column for the primary key with insertion-order sorting for the table's data by default. However, rows belonging together are stored in the database file very far away, resulting in many disk fetch operations. When instead e.g. product comments are stored sorted by the product and comment order, many queries will be more efficient: Selecting the most recent comments for a product is fetching just some adjacent rows instead of rows very far away.

Usage

MySQL

CREATE TABLE product_comments (
  product_id bigint,
  comment_id bigint auto_increment UNIQUE KEY,
  message text,
  PRIMARY KEY (product_id, comment_id)
);

SELECT *
FROM product_comments
WHERE product_id = 2
ORDER BY comment_id ASC
LIMIT 10

PostgreSQL

CREATE TABLE product_comments (
  product_id bigint,
  comment_id bigint GENERATED ALWAYS AS IDENTITY,
  message text,
  PRIMARY KEY (product_id, comment_id)
);

CLUSTER product_comments USING product_comments_pkey;

SELECT *
FROM product_comments
WHERE product_id = 2
ORDER BY comment_id ASC
LIMIT 10

Detailed Explanation

Auto-incrementing primary keys

Most tables are created with surrogate primary keys: Instead of using a primary key based on the actual data, an e.g. automatically incremented column is used to identify a row. Changes to the data will not result in having to change the primary and all references to the table as with natural primary keys, which significantly reduces implementation effort. Because of this reason, it is the most used approach for selecting a primary key.

When data size exceeds the server's memory

However, when the data size exceeds the memory size, this approach can be a performance penalty. A simple query like getting the ten most recent comments for a product, as seen in the example, has particular performance implications: The ten most recent records are filtered by an index and need to be looked up by the database. When all data fits into memory, the database can look up these rows from the in-memory cached database file. The performance will be stellar. If not all data fits into memory, the case may be different. As rows are saved in the database file in insertion order, ten disk fetches from different locations are needed. The performance is much worse as many random i/o operations on the disk are needed.

Optimized storage for less disk seeks

When data is saved in an optimized way for the most important querying patterns, the performance will still be good for data exceeding the server's memory size. As the number of data fetches constrains the performance, reducing them will increase the performance. By storing the rows in the database table file already sorted by the product and age of the comment, the database will not have to execute many data fetches to different locations. The amount of work is reduced from multiple data fetches at different offsets to a single disk seek to one offset and then reading several adjacent records. For HDDs and SSDs sequential i/o operations are always much faster than random i/o operations to random offsets. The way data is stored can have a big impact on performance! This optimization is not limited to selecting a range of records. It is also applicable for statistical queries (aggregations) for a subset of the data. When e.g. calculating statistics about a specific product, user, or any other resource having the needed records adjacent to each other reduces disk seeking and fetching time.

MySQL implementation

It is easy in MySQL to claim this performance improvement as the database file is automatically sorted by the primary key. The primary key just needs to be defined in a suitable way to sort the data best for the application's requirements. However, as the data is inserted in sorted order, the write operations are now saving new rows to random offsets in the database file instead of appending them to the end. Consequently, any new row insertion will be slower. However, for read-mostly datasets with new records rarely inserted, the performance impact is not a big deal.

The slow insert performance can be solved with a new approach for inserting new data:

A new unsorted table for recent records is created.
Any new record is saved to this new table instead of the standard sorted table for fast insert speed.
After some minutes, all records are moved to the standard table.
Any query reading data has naturally to read from both tables as the recent table can have data that has not been moved yet. To simplify this, a view can union the data of both tables which is used for effortless querying of the data.

Instead of repeatedly writing to e.g. thousands of random locations with single row inserts, the inserts are now bulked together in more efficient write operations to the sorted table.

PostgreSQL implementation

For PostgreSQL, the approach is different, as no automatic sorting is available. Compared to MySQL the insert performance will not suffer as the rows are only appended to the end of the file. The CLUSTER command will sort the rows in the database each time when it is executed. But the command is locking the whole table which will prevent any read or write operation. The application will be unresponsive for the duration of the sorting procedure. However, the PostgreSQL community built the pg_repack extension that can sort the data without locking the table. It is creating a copy of the table with all data sorted and swaps them when the copy process is finished. While the data is copied it is stored on the disk two times for some time which is problematic for large datasets. To use less space the data can be partitioned into smaller tables, so the increased storage space is not a big problem anymore. If e.g. monthly partitions can be used, past partitions do not have to be sorted anymore as no new data will be saved to them and they are already sorted.

Impact on ORMs

Many libraries do not support primary keys consisting of multiple columns. Also, many frameworks do not support many columns in an URL to reference a single row in the database. In these cases, those software can most often be tricked by pretending the `comment_id` to be the primary key as it is still a unique value for all records. Alternatively, a UUID column can be used, which is also preventing enumeration attacks.

Additional Resources

MySQL Documentation: Clustered and Secondary Indexes for the InnoDB storage engine.
PostgreSQL documentation: The CLUSTER command to sort a table's data.

Tobias Petry

SQL for Devs Founder

Are you interested to learn more?

Be notified on future content. Never spam.