Appendix
Every benchmark is flawed. It's impossible to create one that applies to everyone, because databases can be used in many different ways. As a result, you shouldn't trust any benchmark and always run your own tests with your own data and queries that match your specific situation. The best a generic benchmark can strive for is to be as fair and balanced as possible - which is what I did.
My goal with the course is to show you how easily you can swap out your current struggling database for TimescaleDB without having to learn a whole new way of doing things. That's what I love about TimescaleDB: Just a few queries and you're done. Pure magic. So in this benchmark, I'll compare TimescaleDB to MySQL and PostgreSQL, which are the two most widely used open-source databases. If you're interested in seeing how TimescaleDB stacks up against other analytical databases, you'll need to look elsewhere. There are plenty of ones to explore but they are all not as easy to use.
Only two good benchmarks compare TimescaleDB to MySQL/PostgreSQL. And both of them have some problems - as I believe:
I've used c6a.4xlarge instances on AWS with a 500GB gp3 EBS volume, which is the standard in the database community for benchmarking. This machine has 16 vCPUs, 32GB of memory and can handle 3,000 IOPS. To ensure a fair comparison, I made some common-sense configuration tweaks to all databases (avoiding any little-known tricks):
I was unable to find a dataset that perfectly fit my benchmark needs. Building one from scratch wasn't an option, as I would have produced unrealistic data distributions and therefore an unrealistic benchmark. After considering my options, I've chosen to use the ClickBench dataset, which provides information about website visits: This dataset is a single table covering all possible information about website visitors with nearly 100 million rows (99,997,497 to be precise) in the period from 2013-07-01 20:00:00 to 2013-07-31 19:59:59. The CSV file is quite large with 80.95 GB and 105 columns.
However, I had to make some changes to the data to make it more realistic:
A TimescaleDB hypertable can be configured in many ways, and I've tried to predict the most common approach used in real-world applications. However, this means the benchmark isn't optimized for maximum performance. Instead, it's intended to reflect real-world usage.
I've activated the TimescaleDB Columnstore (Hypercore) as you would typically do: I set the segmentation to the `Tenant` column and the sorting to `EventTime` - no specific tuning applied. I also limited the columnstore to exclude the newest day (`2013-07-31`), since it would still be being inserted into a real application and most hypertables are likely configured this way. As a result, queries that include the last day won't benefit from the columnstore's improved performance. This approach benefits MySQL and PostgreSQL.
For all queries, a TimescaleDB Continuous Aggregate (incremental materialized view) has been created to optimize performance. To make this more realistic, the continuous aggregates have been defined with a one-hour time range, even though the queries group data by days. This setup would allow reuse of the continuous aggregates for drilling down into a specific day, although it's not used in the benchmark. The trade-off is a slight loss of performance, but it's a more realistic use case. This approach benefits MySQL and PostgreSQL.
I've also added TimescaleDB Realtime Continuous Aggregates that don't pre-summarize the last hour of data. Some applications can show slightly delayed data and some must show live data. It's impossible to predict what you need. So these realtime continuous aggregates always calculate live data from the hypertable between '2013-07-31 19:00:00' and '2013-07-31 19:59:59', and then merge it with the pre-summarized data from previous hours. Also, keep in mind that the last day isn't stored in a TimescaleDB columnstore, so all rows must be fetched completely unoptimized like PostgreSQL does. This approach benefits MySQL and PostgreSQL.
I've decided to run five different queries with increasing complexity. I've spent a week and a half on this benchmark, and that's enough for now.
I focused the benchmark on hot run performance where all data is already in memory. This skips the slower initial query run, which would be very hard to benchmark accurately. This approach favors MySQL and PostgreSQL because their on-disk size is much larger than TimescaleDB's. Additionally, they cannot selectively read only the needed columns from disk, unlike TimescaleDB's columnstore.
To warm up the cache, all queries are executed five times. Then, each query was executed 11 times and the median runtime used as the result. I reran tests to ensure reliable results when at least one of the benchmark runs differed significantly from the median performance.
The following results compare PostgreSQL and MySQL to different configurations of TimescaleDB (as explained before):
As expected, MySQL and PostgreSQL have similar table sizes after importing the dataset. Interestingly, these sizes are very close to the 80.95 GB CSV file. TimescaleDB achieves a 60% reduction in size, which is a decent compression ratio, although I've seen much better results in other projects. The anonymization process in the ClickBench dataset made compression more challenging by transforming long strings such as URLs, page titles and referers to mostly random values.
SELECT COUNT(*) FROM hits WHERE Tenant = 1;
-- Continuous Aggregate Definition:
CREATE MATERIALIZED VIEW hits_hourly WITH (timescaledb.continuous) AS
SELECT TIME_BUCKET('1 hour', EventTime) AS hour, Tenant, COUNT(*) AS cnt
FROM hits
GROUP BY hour, Tenant;
-- Continuous Aggregate Query:
SELECT SUM(cnt) from hits_hourly WHERE Tenant = 1 SELECT COUNT(*) FROM hits WHERE Tenant = 1; This query is straightforward because the database only needs to scan the index for the tenant and count the entries. It provides a baseline to estimate the performance cost of handling large operations on 10 million rows.
The results confirm that MySQL is not suitable for statistical queries (even simple ones) because a single thread must execute the whole query and examine all the data. On the other hand, PostgreSQL's performance is acceptable given it has to count ten million rows by splitting the task among six workers operating in parallel.
I didn't expect TimescaleDB to outperform the others yet at this point, given the query's simplicity. However, CPU vectorization instructions operating on many values at once rather than processing them individually (like MySQL and PostgreSQL) resulted in a big speed boost. The runtime is nearly instant, thanks to pre-calculating results either completely (Continuous Aggregate) or by processing only the freshest data and merging it with pre-calculated results for older data (Realtime Continuous Aggregate).
SELECT TIME_BUCKET('1 hour', EventTime) AS hour, COUNT(*) FROM hits WHERE Tenant = 1 GROUP BY hour ORDER BY hour ASC;
-- Continuous Aggregate Definition:
-- (reuse of hits_hourly from before!)
-- Continuous Aggregate Query:
SELECT hour, cnt FROM hits_hourly WHERE Tenant = 1 ORDER BY hour ASCSELECT DATE_TRUNC('hour', EventTime) AS hour, COUNT(*) FROM hits WHERE Tenant = 1 GROUP BY hour ORDER BY hour ASCSELECT DATE_FORMAT(EventTime, '%Y-%m-%d %H:00:00') AS hour, COUNT(*) FROM hits WHERE Tenant = 1 GROUP BY hour ORDER BY hour ASC The previous query has been refined to calculate hourly counts, rather than the entire range. This should now test calculation overhead more accurately, going beyond just walking the index but its still an index-only query.
As expected, runtime increases as more work must be done - except for the Continuous Aggregates which pre-calculated everything. I don't know why the TimescaleDB columnstore query is slower than PostgreSQL. But hey, at least it's an honest benchmark showing that another database could be faster.
SELECT AVG(windowclientwidth) FROM hits WHERE Tenant = 1 and IsMobile = 0 and EventTime >= '2013-07-24 00:00:00';
-- Continuous Aggregate Definition:
CREATE MATERIALIZED VIEW hits_windowclientwidth WITH (timescaledb.continuous) AS
SELECT TIME_BUCKET('1 hour', EventTime) AS hour, Tenant, IsMobile, STATS_AGG(windowclientwidth) AS windowclientwidth
FROM hits
GROUP BY hour, Tenant, IsMobile;
-- Continuous Aggregate Query:
SELECT AVERAGE(ROLLUP(windowclientwidth)) FROM hits_windowclientwidth WHERE Tenant = 1 AND IsMobile = 0 AND hour >= '2013-07-24 00:00:00' SELECT AVG(windowclientwidth) FROM hits WHERE Tenant = 1 and IsMobile = 0 and EventTime >= '2013-07-24 00:00:00' This query is the first one needing an extra column from the table (windowclientwidth), while the columns for filtering (Tenant, IsMobile and EventTime) are covered by the index. It's also worth noting that the query is now only handling 1,797,511 rows because of the additional filters.
The runtime aligns with expectations based on the performance differences observed with the previous queries. However, I noticed that MySQL used an unfitting (and slower) index instead of the specialized one I created for this query. So I've forced it to use the correct one for these benchmark results which resulted in two seconds less runtime. This approach favors MySQL.
SELECT RefererCategoryID, Income, COUNT(*) FROM hits WHERE Tenant = 1 AND EventTime BETWEEN '2013-07-21 00:00:00' AND '2013-07-21 23:59:59' AND RefererCategoryID > 0 GROUP BY RefererCategoryID, Income ORDER BY COUNT(*) DESC LIMIT 50;
-- Continuous Aggregate Definition:
CREATE MATERIALIZED VIEW hits_referers WITH (timescaledb.continuous) AS
SELECT TIME_BUCKET('1 hour', EventTime) AS hour, Tenant, RefererCategoryID, Income, COUNT(*) AS cnt
FROM hits
GROUP BY hour, Tenant, RefererCategoryID, Income;
-- Continuous Aggregate Query:
SELECT RefererCategoryID, Income, SUM(cnt) FROM hits_referers WHERE Tenant = 1 AND hour BETWEEN '2013-07-21 00:00:00' AND '2013-07-21 23:59:59' AND RefererCategoryID > 0 GROUP BY RefererCategoryID, Income ORDER BY SUM(cnt) DESC LIMIT 50 SELECT RefererCategoryID, Income, COUNT(*) FROM hits WHERE Tenant = 1 AND EventTime BETWEEN '2013-07-21 00:00:00' AND '2013-07-21 23:59:59' AND RefererCategoryID > 0 GROUP BY RefererCategoryID, Income ORDER BY COUNT(*) DESC LIMIT 50 This query makes the database's job even harder, as it now has to group the rows, sort them, and pick the best ones. The index still handles the filtering, producing 1,122,373 rows, but the database must load the columns needed for grouping from the table.
Again, the performance is as expected by the previous queries. And again MySQL had to be forced to use the correct index. What are you doing MySQL? You had one job and you did it wrong... This approach favors MySQL.
SELECT RegionId, COUNT(DISTINCT DATE_TRUNC('day', EventTime)::text || UserID) AS users FROM hits WHERE Tenant = 1 AND EventTime BETWEEN '2013-07-01 00:00:00' AND '2013-07-05 23:59:59' GROUP BY RegionId ORDER BY users DESC LIMIT 10;
-- Continuous Aggregate Definition:
CREATE MATERIALIZED VIEW hits_regions WITH (timescaledb.continuous) AS
SELECT TIME_BUCKET('1 day', EventTime) AS day, Tenant, RegionID, COUNT(DISTINCT UserId) AS users
FROM hits
GROUP BY day, Tenant, RegionID;
-- Continuous Aggregate Query:
SELECT RegionID, SUM(users) AS users FROM hits_regions WHERE Tenant = 1 AND day BETWEEN '2013-07-01' AND '2013-07-05' GROUP BY RegionID ORDER BY users DESC LIMIT 10 SELECT RegionId, COUNT(DISTINCT DATE_TRUNC('day', EventTime)::text || UserID) AS users FROM hits WHERE Tenant = 1 AND EventTime BETWEEN '2013-07-01 00:00:00' AND '2013-07-05 23:59:59' GROUP BY RegionId ORDER BY users DESC LIMIT 10 SELECT RegionId, COUNT(DISTINCT CONCAT(DATE_FORMAT(EventTime, '%Y-%m-%d'), UserID)) AS users FROM hits WHERE Tenant = 1 AND EventTime BETWEEN '2013-07-01 00:00:00' AND '2013-07-05 23:59:59' GROUP BY RegionId ORDER BY users DESC LIMIT 10 This query is the most demanding one in the benchmark. The only way for a database to execute a COUNT DISTINCT query is to keep track of every value it finds to ignore duplicates. This type of query is very resource-intensive!
MySQL and PostgreSQL now perform similarly because the COUNT DISTINCT calculation is so resource-intensive that MySQL's overhead no longer makes a difference. And the TimescaleDB Columnstore is only 30% faster than both of them because the heavy COUNT DISTINCT computation reduced the columnstore performance gain significantly. Once again, pre-calculating results with Continuous Aggregates proves to be a good solution, as it shifts the slow query execution time to a one-time operation when the rows are pre-summarized.
When you combine the previous runs into one number, you'll see these impressive performance numbers. Keep in mind that the benchmark setup was intentionally configured to favor PostgreSQL and MySQL to be as fair as possible and all data was cached in memory! With disk access, the numbers for MySQL and PostgreSQL are much much higher and TimescaleDB would shine much more as it would only load the needed columns.
As always, please run your own benchmarks! But it's hard to beat TimescaleDB's (Realtime) Continuous Aggregates, since the heavy computations are already done and you get the results for now free. You can also try building this yourself with MySQL and PostgreSQL, but it's a lot of work that's better spent developing your application and hard to make performant if you can't hook deep into the database.