- Author(s): @guo-shaoge, @wjhuang2016
- Discussion PR: #24147
- Tracking Issue: #17472
- Introduction
- Motivation or Background
- Detailed Design
- Test Design
- Impacts & Risks
- Investigation & Alternatives
- Unresolved Questions
- Future Work
This proposal describes the basic implementation of Common Table Expression(CTE) using the Materialization
method.
Merge
is another way to implement CTE and will be supported later.
CTE was introduced in SQL:1999. It's a temporary result set which exists within a statement. You can reference it later in the statement. It's similar to derived tables in some way. But CTE has extra advantages over derived tables.
- CTE can be referenced multiple times.
- CTE is easier to read.
- You can use CTE to do hierarchical queries.
There are two kinds of CTE:
-
non-recursive CTE.
WITH cte1 AS (SELECT c1 FROM t1) cte2 AS (SELECT c2 FROM t2) SELECT cte1.c1, cte2.c2 FROM cte1, cte2 WHERE cte1.c1 = cte2.c2 AND cte1.c1 = 100;
-
recursive CTE, which can be used to do hierarchical queries. Recursive CTE normally consists of the seed part and the recursive part. Seed part will generate the origin data, the remaining computation will be done by the recursive part.
WITH RECURSIVE cte1 AS ( SELECT part, sub_part FROM t WHERE part = 'human' UNION ALL SELECT t.part, t.sub_part FROM t, cte1 WHERE cte1.sub_part = t.part ) SELECT * FROM cte1;
There are two ways to implement CTE:
- Merge: just like view, CTE is expanded on where it's used.
- Materialization: use temporary storage to store the result of CTE, and read the temporary storage when using CTE.
Merge
is normally a better way to implement CTE, because optimizer can pushdown predicates from outer query to inner query.
But when CTE is referenced multiple times, Materialization
will be better. And recursive CTE can only be implemented by Materialization
.
For simplicity, this design doc only uses Materialization
to implement both non-recursive and recursive CTE.
So there is no need to consider how to choose between these two methods, so as to reduce the chance of bugs.
And in the near future, Merge
will be definitely supported to implement non-recursive CTE.
RowContainer
will be used to store the materialized result.
The result will first reside in memory, and when the memory usage exceeds tidb_mem_quota_query
, intermediate results will be spilled to disk.
For recursive CTE, the optimizer has no way to accurately know the number of iterations. So we use the seed part's cost as CTE's cost.
Also we will add a system variable cte_max_recursion_depth
to control the max iteration number.
If the maximum iteration number is reached, an error will be returned.
A major advantage of Materialization
is that data will only be materialized once even if there are many references to the same CTE.
So we will add a map to record all CTE's subplan, these subplans will only be optimized and executed once.
type LogicalCTE struct {
logicalSchemaProducer
cte *CTEClass
cteAsName model.CIStr
}
type CTEClass struct {
IsDistinct bool
seedPartLogicalPlan LogicalPlan
recursivePartLogicalPlan LogicalPlan
IDForStorage int
LimitBeg uint64
LimitEnd uint64
}
LogicalCTE
is the logical operator for CTE. cte
is a member of CTEClass
type.
The content of CTEClass
will be the same if different CTE references point to the same CTE.
The meanings of main members in CTEClass are as follows:
IsDistinct
: True if UNION [DISTINCT] and false if UNION ALL.seedPartLogicalPlan
: Logical Plan of seed part.recursivePartLogicalPlan
: Logical Plan of recursive part. It will be nil if CTE is non-recursive.IDForStorage
: ID of intermediate storage.LimitBeg
andLimitEnd
: Useful when limit is used in recursive CTE.
type LogicalCTETable struct {
logicalSchemaProducer
name string
idForStorage int
}
LogicalCTETable
will read the temporary result set of CTE. idForStorage
points to the intermediate storage.
PhysicalCTE
and PhysicalCTETable
will be added corresponding to LogicalCTE
and LogicalCTETable
.
type CTEExec struct {
exec.BaseExecutor
seedExec Executor
recursiveExec Executor
resTbl cteutil.Storage
iterInTbl cteutil.Storage
iterOutTbl cteutil.Storage
hashTbl baseHashTable
}
CTEExec
will handle the main execution. Detailed execution process will be explained later.
seedExec
andrecursiveExec
: Executors of seed part and recursive part.resTbl
: The final output is stored in this storage.iterInTbl
: Input data of each iteration.iterOutTbl
: Output data of each iteration.hashTbl
: Data will be deduplicated if UNION [DISTINCT] is used.
type CTETableReaderExec struct {
exec.BaseExecutor
iterInTbl cteutil.Storage
chkIdx int
curIter int
}
CTETableReaderExec
is used in recursive CTE. It will read iterInTbl
and the output chunk will be processed in CTEExec
.
type Storage interface {
OpenAndRef() bool
DerefAndClose() error
Add(chk *chunk.Chunk) error
GetChunk(chkdIdx int) (*chunk.Chunk, error)
Lock()
Unlock()
}
Storage
is used to store the intermediate data of CTE. Check the resTbl
, iterInTbl
and iterOutTbl
in CTEExec
.
Since there will be multiple executors using one Storage
, we use a ref count to record how many users currently there are.
When the last user calls Close(), the Storage
will really be closed.
OpenAndRef()
: Open thisStorage
, if already opened, add ref count by one.DerefAndClose()
: Deref and check if ref count is zero, if true, the underlying storage will be truly closed.Add()
: Add chunk into storage.GetChunk()
: Get chunk by chunk id.Lock()
andUnlock
:Storage
may be used concurrently. So we need to add a lock.
In parsing phase, definition of CTE will be parsed as a subtree of the outermost select stmt.
The parsing phase will generate an AST, which will be used to generate LogicalCTE
. This stage will complete the following steps:
-
Distinguish between seed part and recursive part of the definition of CTE. And build logical plans for them.
-
Do some validation checks.
-
Mutual recursive(cte1 -> cte2 -> cte1) is not supported.
-
Column number of the seed part and the recursive part must be same.
-
All seed parts should follow recursive parts.
-
recursive parts cannot include:
ORDER BY
,Aggregate Function
,Window Function
,DISTINCT
.
-
-
Recognize the same CTE. If there are multiple references to the same CTE.
We use the following SQL to illustrate:
WITH RECURSIVE cte1 AS (SELECT c1 FROM t1 UNION ALL SELECT c1 FROM cte1 WHERE cte1.c1 < 10) SELECT * FROM t2 JOIN cte1;
The logical plan of above SQL will be like:
In this stage, the LogicalCTE
will be converted to PhysicalCTE
. We just convert logical plans of the seed part and the recursive part to its physical plan.
Also LogicalCTETable
will be converted to PhysicalCTETable
.
The Physical Plan will be like:
Three structures will be constructed:
CTEExec
: Evaluate seed part and recursive part iteratively.CTETableReaderExec
: Read result of previous iteration and return result to parent operator.Storage
: This is where the materialized results are stored.CTEExec
will write it andCTETableReaderExec
will read it.
The executor tree will be like:
The following pseudo code describes the execution of CTEExec
:
func (e *CTEExec) Next(req *Chunk) {
// 1. The first executed CTEExec will be responsible for filling storage.
e.storage.Lock()
defer e.storage.Unlock()
if !e.storage.Done() {
// 1.1 Compute seed part and store data into e.iterInTbl.
for {
// 1.2 Compute recursive part iteratively and break if reaches termination conditions.
}
e.storage.SetDone()
}
// 2. Return chunk in e.resTbl.
}
Only the first CTEExec
will fill the storage. Others will be waiting until the filling process is done.
So we use Done()
to check at the beginning of execution.
If the filling process is already done, we just read the chunk from storage.
The filling of Storage
is done by CTEExec
and CTETableReaderExec
together. The following figure describes the process.
-
Compute the
SeedExec
(step 1) and all output chunks will be stored intoiterInTbl
(step 2), which will be the input data of next iteration. -
Compute the
RecursiveExec
iteratively.iterInTbl
will be read byCTETableReaderExec
(step 3) and its output will be processed by executors inRecursiveExec
(step 4,5). And finally the data will be put intoiterOutTbl
(step 5). -
At the end of each iteration, we copy all data from
iterOutTbl
toiterInTbl
andresTbl
(step 7). So the output of the previous iteration will be the input of the next iteration. -
Iteration will be terminated if:
- No data is generated in this iteration or
- Iteration number reaches
@@cte_max_recursion_depth
or - Execution time reaches
@@max_execution_time
.
Data in storage will be spilled to disk if memory usage reaches @@tidb_mem_quota_query
. MemTracker
and RowContainer
will handle all the spilling process.
Also we use a hash table to de-duplicate data in resTbl
. Before copying data from iterOutTbl
to resTbl
, we use this hash table to check if there are duplications.
We also support use LIMIT in recursive CTE. By using LIMIT, users don’t have to worry about infinite recursion. Because if the number of output rows reaches the LIMIT requirement, the iteration will be terminated early.
- Basic usage of non-recursive and recursive CTE.
- Define a new CTE within a CTE.
- Use CTE in subquery.
- Use CTE in UPDATE/DELETE/INSERT statements.
- CTE name conflicts with other table names.
- Join CTE with other tables/CTE.
- Use expressions with CTE.
We should test CTE used together with other features:
- Use CTE with PREPARE/EXECUTE/PlanCache.
- Use CTE with a partition table.
- Stale read.
- Clustered index.
- SPM/Hint/Binding.
None
A basic performance test should be given in a specific scenario. The memory usage, disk usage and QPS should be reported.
CTE is a new feature and it will not affect the overall performance.
But for now we only use Materialization
to implement non-recursive CTE.
In some scenarios, the performance may not be as good as implementations which use Merge
.
Most mainstream DBMS use Merge
and Materialization
to implement non-recursive cte, while recursive cte can only be implemented with Materialization
.
Merge
is preferred when there are no side effects, such as random()/cur_timestamp() is used in subquery.
But when CTE is referenced multiple times, Materialization
may be better.
Also users can control which method to use explicitly by using hints.
For Materialization
, most DBMS use some kind of container to store materialized result,
which will be spilled to disk if the size is too large. The computation steps for recursive CTE are all similar.
But different system use different ways to try to optimize:
- Try to postpone materialization.
- Pushdown predicates to reduce the size of the temporary table.
None
- Support
Merge
and related hints(merge/no_merge). - Optimize
Materialization
, pushdown predicates to the materialized table. - MPP support.
CTEExec
will be implemented in TiFlash later. But SQL used in CTE definition still can be pushed to TiFlash.