Fill Tables With Large Amounts Of Test Data

The more rows a table has, the more reliable are the performance tests for a query. But instead of writing a run-once script, you can fill a table with a single SQL query with even millions of rows.

Your tables have to be much bigger for performance tests than they are typical in your development or staging environments. Writing run-once scripts to fill a table is the most used approach to solve that problem. But either it will take very long by inserting one row after another, or the development gets more complex by implementing optimal chunked inserts. You can just write a single SQL query to insert an arbitrary amount of rows all at once and let the database make it fast.

Usage

MySQL

SET cte_max_recursion_depth = 4294967295;
INSERT INTO contacts (firstname, lastname)
WITH RECURSIVE counter(n) AS(
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1 FROM counter WHERE n < 100000
)
SELECT CONCAT('firstname-', counter.n), CONCAT('lastname-', counter.n)
FROM counter

PostgreSQL

INSERT INTO contacts (firstname, lastname)
SELECT CONCAT('firstname-', i), CONCAT('lastname-', i)
FROM generate_series(1, 100000) as i

Detailed Explanation

Quite often, you have to make your tables much bigger than they are with test data. You want to test whether your application or an index still works the same when you have at least 10x or 100x of data. With more data, an index scan could take longer, less data cached in memory could increase data fetching time and many more cases could lead to reduced performance. So better fill up that table to test early.

The traditional approach is to write a small script in your favorite programming language that is only used once. With some fake data generation library, you can create thousands of rows within milliseconds. However, this approach will be slow as you insert the rows individually. You could restructure the code to work in large chunks by doing bulk inserts but that would take more effort to develop, so it is skipped. You only need it once. Why take more time to develop this?

But there is a much more straightforward way: You can do everything with a single SQL query and let the database optimize it to make this massive bulk import efficient. The idea is to utilize the database to create and iterate a counter up to a specified maximum value. For each of those counter values, you can insert some static values or concatenate them with the counter to make unique ones. With PostgreSQL, you can use the generate_series function that generates a virtual table filled with all values between your starting and end point. But with other databases, you have to emulate that functionality: With recursive common table expression (CTE), you write a counter as a starting point and recursively add another row to that result set until the desired size is reached.

DISCLAIMER: You should never forget that your generated test data behaves differently than the actual production data. Whether you create the data with this approach or some data-faking library makes no difference. All data has some patterns you may or may not know. Your biggest customers, as estimated by the Pareto principle, are 80% of your total storage size. And each or all of them together have a different data distribution than you would generate manually: Some values are barely used at all and some form hotspots that are used all over the place. This difference in data distribution may lead to indexes being used in production that won't be used with test data and vice versa.

Additional Resources

Author Tobias Petry

Tobias Petry

SQL for Devs Founder


Are you interested to learn more?

Be notified on future content. Never spam.