Skip to content

[SQL] Unused field analysis for aggregates#5601

Merged
mihaibudiu merged 1 commit intomainfrom
issue5541
Feb 28, 2026
Merged

[SQL] Unused field analysis for aggregates#5601
mihaibudiu merged 1 commit intomainfrom
issue5541

Conversation

@mihaibudiu
Copy link
Contributor

Fixes #5541

@mihaibudiu mihaibudiu marked this pull request as draft February 11, 2026 05:25
@mihaibudiu
Copy link
Contributor Author

Interestingly, this pass actually discovers unused aggregate functions which are introduced by the Calcite decorrelator!

@mihaibudiu mihaibudiu marked this pull request as ready for review February 11, 2026 06:10
@mihaibudiu mihaibudiu marked this pull request as draft February 11, 2026 06:33
@mihaibudiu mihaibudiu marked this pull request as ready for review February 11, 2026 23:50
@mihaibudiu mihaibudiu force-pushed the issue5541 branch 2 times, most recently from 887e62d to 03616ff Compare February 11, 2026 23:56
@mihaibudiu mihaibudiu marked this pull request as draft February 12, 2026 00:01
@mihaibudiu
Copy link
Contributor Author

Calcite compiles the following query:

create materialized view q4
                as select
                        o_orderpriority,
                        count(*) as order_count
                from
                        orders
                where
                        o_orderdate >= date '1993-07-01'
                        and o_orderdate < date '1993-07-01' + interval '3' month
                        and exists (
                                select
                                        *
                                from
                                        lineitem
                                where
                                        l_orderkey = o_orderkey
                                        and l_commitdate < l_receiptdate
                        )
                group by
                        o_orderpriority

To a plan containing the following fragment:

              LogicalJoin(condition=[=($0, $9)], joinType=[inner]), id = 574
                LogicalFilter(condition=[SEARCH($4, Sarg[[1993-07-01..1993-10-01)])]), id = 552
                  LogicalTableScan(table=[[schema, orders]]), id = 68
                LogicalAggregate(group=[{0}], agg#0=[MIN($1)]), id = 559
                  LogicalProject(l_orderkey=[$0], $f0=[true]), id = 557
                    LogicalFilter(condition=[<($11, $12)]), id = 555
                      LogicalTableScan(table=[[schema, lineitem]]), id = 70

By using a MIN(true) aggregate to figure out whether the EXISTS subquery produces any rows. Turns out that the result of the MIN is never used, and this new optimization can actually completely remove it, leaving an aggregate... which does nothing. This is implemented much more efficiently by using a linear aggregate, which returns Tup0 if there is any value in the collection, or nothing otherwise.

@mihaibudiu mihaibudiu marked this pull request as ready for review February 12, 2026 00:51
@mihaibudiu
Copy link
Contributor Author

The analysis can also discover that the following SUM aggregate is not used, since it's used only for ORDER BY, which is ignored by default:

                select deptno
                from emp
                group by deptno
                order by sum(sal) filter (where job = 'CLERK')

@mihaibudiu
Copy link
Contributor Author

However, if an aggregate is evaluated just for side-effects (i.e., crash on overflow), removing aggregates is not sound.

Copy link
Collaborator

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two hard-block issues here, plus one clear test bug.

1. No manual testing description. The PR body contains only "Fixes #5541" with no description of what manual testing was done. Per the team's process, every PR must describe the manual testing steps taken.

2. Soundness concern is unhandled (see inline comment on processAggregate). You flagged this yourself in the PR comments — but the code doesn't address it. Until there is either a fix or an explicit decision that Feldera's semantics don't include overflow-crash guarantees (with that documented in a code comment), this is an open correctness concern that should block merge.

3. Copy-paste bug in issue5541b (see inline comment). The expected output uses column name a but the view produces min.

Minor: the commit message has no body. For a change of this scope — renaming OptimizeMapsOptimizeProjections, adding aggregate elimination, introducing empty-aggregate support — a body explaining the why (e.g., decorrelator-introduced MIN(true) aggregates, ORDER BY on unreferenced aggregates) would help future readers understand the change in isolation. Please use git rebase -i to add one before merge.

The optimization itself — particularly the "filler" MapIndex that restores the original output shape while skipping the removed computation, and the empty-aggregate → linear-aggregate path — is elegant. The fanout guards are correctly placed.

Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Comment on lines +1338 to +1344
// Semigroup for the an aggregate which computes nothing
// Useful when the compiler removes all aggregates from an aggregate operator
// (This is currently never used, because the compiler generates a linear aggregate
// for this case)
#[derive(Clone)]
#[doc(hidden)]
pub struct EmptySemigroup;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add it if it will never be used?

But I do see a reference later on in the commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that code in the compiler is unreachable, but I tested it with this implementation.
I have spent time thinking about this, so I decided to keep the implementation - maybe it will be necessary in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does no harm 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it does make compilation slower...

@mihaibudiu mihaibudiu added this pull request to the merge queue Feb 27, 2026
Merged via the queue into main with commit 4842324 Feb 28, 2026
9 checks passed
@mihaibudiu mihaibudiu deleted the issue5541 branch February 28, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SQL] The unused fields analysis does not remove unused aggregates

3 participants