Recursive queries are a straightforward solution to querying hierarchical trees. However, one loop in the relationship references results in a failing or never ending query when cycle detection is not used.
For storing trees in a database the adjacency list model is a straightforward implementation, where every row contains a reference to the parent row. One recursive query can select all parents or children for a specific row instead of the most common implementation to execute multiple queries. However, accidental loops in the data's references would fail a recursive query when cycle detection is not used to skip these errors. The database can provide cycle detection, or the same logic is implemented manually with a JSON array accumulating the recursion's path to stop at already seen rows.
WITH RECURSIVE tree AS (
SELECT id, name, 1 AS level, JSON_ARRAY(id) AS path
FROM employees
WHERE manager_id = 1
UNION
SELECT employees.id, employees.name, tree.level + 1, JSON_ARRAY_APPEND(tree.path, '$', employees.id)
FROM tree
JOIN employees ON(tree.id = employees.manager_id)
WHERE NOT employees.id MEMBER OF(tree.p)
)
SELECT * FROM tree
WITH RECURSIVE tree AS (
SELECT id, name, 1 AS level
FROM employees
WHERE manager_id = 1
UNION
SELECT employees.id, employees.name, tree.level + 1
FROM tree
JOIN employees ON(tree.id = employees.manager_id)
) CYCLE id SET is_cycle USING cycle_path
SELECT * FROM tree
Trees can be stored in a database as e.g. nested sets or materialized paths.
However, the most popular one is the adjacency list model storing a
parent_id
reference within every record.
This approach has the most effortless implementation for saving the data as no complex transformations need to be done.
It is a very straightforward concept.
A naive implementation will select the data for one level and execute consecutive queries for every further tree level. For big trees, this will result in increased latency as another query needs to be executed after the last one has finished. However, a single recursive query can do that in a single step:
tree
is created using a recursive Common Table Expression (CTE) that accumulates the results.
As the CTE is recursive, the queries within the CTE can reference themselves.
SELECT id, name, 1 AS level FROM employees WHERE manager_id = 1
tree
CTE will be executed to find new rows from a different level of the tree incrementing the level value:
SELECT employees.id, employees.name, tree.level + 1 FROM tree JOIN employees ON(tree.id = employees.manager_id)
The adjacency list is a simple model which is easy to implement but has one serious drawback: The relationships should not have loops! In PostgreSQL such a query will run endlessly and never finish. However, limiting the rows read from the recursive CTE will automatically stop it when the threshold is reached. This workaround will not work in MySQL as the recursive CTE will first have to finish executing before the data is read from it and a LIMIT will be applied. Nevertheless, in MySQL any recursive query will automatically fail with an error after 1.000 recursive steps. This limit can be increased with the cte_max_recursion_depth system variable to search within deep trees.
These safety measures to limit the runtime of a query are lovely, but the optimal solution would be to stop the recursion when a loop is detected.
This feature is called “cycle detection” and is part of the SQL standard.
It is supported since PostgreSQL 14 and only needs a small extension to the CTE.
The
CYCLE
clause states that loops should be detected on the
id
column by automatically accumulating the tree path in the
cycle_path
` column and using it to calculate the exit condition
is_cycle
:
CYCLE id SET is_cycle USING cycle_path
.
The recursion will automatically stop for any detected loop.
With MySQL and older PostgreSQL versions automatic cycle-detection is not available. Fortunately, the `CYCLE` feature is just syntactic sugar for the manual approach that has been available for years:
SQL for Devs Founder
Be notified on future content. Never spam.