Table Maintenance after Bulk Modifications

Databases are almost perfect at choosing the ideal index to execute a query. But when you do significant bulk modifications, you may have to do some maintenance operations to keep that index selection ability.

The table statistics of databases are a necessary dependency to choose the correct index: The number of rows matching a condition and whether an index or table scan is more efficient is estimated based on the stored distribution of values. But the statistics may become stale if you change a significant number of rows with bulk operations (e.g. importing contacts or deleting many rows). The skewed information may lead to not using any index in production until the database refreshes the statistics. You should manually trigger a statistics update to prevent that problem.

Usage

MySQL

ANALYZE TABLE contacts

PostgreSQL

ANALYZE users

Detailed Explanation

A query's performance depends on the database choosing the correct index. While the decision of the database is usually accurate, it is sometimes off. In most cases, no index is used at all while you know there is a perfect one.

Any modern database is working on a cost-based approach: The database calculates different ways to execute a query and chooses the one with the lowest cost. However, estimating the cost is a complex prediction logic that relies on some heuristic information. The most important one is the table statistic.

The table statistic is a heuristic of the distribution of values in a table. The query planner uses it to estimate the number of rows a condition may match. From then on, it is a simple calculation whether utilizing an index to find and load the matching rows with many random i/o operations is faster than scanning the whole table with sequential i/o operations and only keeping the matching rows.

The database tries to keep the distribution of values updated whenever you modify anything within a table. This process works well, as your queries are usually executed with the correct indexes. Sometimes the database recognizes that all cumulative changes reached a threshold that the accuracy of the statistics may no longer be valid. In that case, the statistics are automatically recreated from a blank state so the query planner can work with the most accurate values.

However, the statistics may become stale when you do bulk modifications on a table. Your changes may not be significant enough to trigger a complete recalculation of the statistics if you e.g. import many contacts from a spreadsheet or delete old data. The threshold of needed changes is typically higher than any number of modifications of a bulk modification for larger tables. But still, the distribution may be skewed enough for the query planner to choose the correct index.

You don't want your production database to start using no index or an incorrect one after bulk modifications. A good countermeasure is to manually trigger the complete recalculation of table statistics after every significant bulk modification to a table. You only have to execute the ANALYZE to recalculate the statistics parallel to all the other queries being executed without blocking anything.

Additional Resources

Author Tobias Petry

Tobias Petry

SQL for Devs Founder


Are you interested to learn more?

Be notified on future content. Never spam.