Skip to content

Add bounded unnest + ordered array_agg reproducer benchmark and SLT coverage#22075

Open
kosiew wants to merge 2 commits intoapache:mainfrom
kosiew:memory_issue-01-20788
Open

Add bounded unnest + ordered array_agg reproducer benchmark and SLT coverage#22075
kosiew wants to merge 2 commits intoapache:mainfrom
kosiew:memory_issue-01-20788

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented May 8, 2026

Which issue does this PR close?

Rationale for this change

This PR adds a compact and reproducible test shape for the reported high-memory query pattern involving:

  • list column expansion via unnest
  • row explosion
  • regrouping with GROUP BY
  • ordered aggregation using array_agg(... ORDER BY ...)

The goal is to isolate and document the execution shape before optimizer or executor fixes are introduced. The reproducer is intentionally bounded so it can run reliably in local and CI environments while still demonstrating the problematic expansion pattern.

What changes are included in this PR?

  • Added a new benchmark:

    • benchmarks/sql_benchmarks/unnest_array_agg/benchmarks/q01.benchmark
  • Added SQLLogicTest coverage:

    • datafusion/sqllogictest/test_files/unnest_array_agg_repro.slt
  • Added a bounded synthetic workload that:

    • creates list columns using range
    • expands them with unnest
    • regroups rows using array_agg(val ORDER BY idx)
  • Added validation of the intermediate row expansion count.

  • Captured EXPLAIN VERBOSE output for the reproducer, including:

    • logical plan
    • initial physical plan
    • physical execution plan
    • schema details for ordered aggregate state
  • Added configurable benchmark scaling via:

    • UNNEST_ARRAY_AGG_ROWS
    • UNNEST_ARRAY_AGG_LIST_LEN

Are these changes tested?

Yes.

This PR adds:

  • SQLLogicTest coverage in:

    • datafusion/sqllogictest/test_files/unnest_array_agg_repro.slt
  • A benchmark reproducer in:

    • benchmarks/sql_benchmarks/unnest_array_agg/benchmarks/q01.benchmark

The SLT verifies:

  • row expansion counts
  • ordered array_agg results
  • EXPLAIN VERBOSE plan shape including UnnestExec and AggregateExec

Are there any user-facing changes?

No user-facing changes. This PR only adds regression coverage and benchmarking infrastructure for a specific query shape.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 2 commits May 8, 2026 15:03
…lity

- Added a new SQL logic test for unnest_array_agg that includes a synthetic bounded repro
- Confirms the transformation from 3×4 to 12 expanded rows
- Validates regrouping of array_agg(val ORDER BY idx)
- Captures the EXPLAIN VERBOSE plan shape with UnnestExec and ordered AggregateExec
- Introduced a scalable benchmark for unnest_array_agg with default settings of 2000 rows and 1000 elements
- Added environment scale knobs for customization:
- UNNEST_ARRAY_AGG_ROWS
- UNNEST_ARRAY_AGG_LIST_LEN
@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label May 8, 2026
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented May 8, 2026

run benchmark sql

env:
  BENCH_NAME: unnest_array_agg
  BENCH_QUERY: 1
  UNNEST_ARRAY_AGG_ROWS: 3
  UNNEST_ARRAY_AGG_LIST_LEN: 4

@apache apache deleted a comment from adriangbot May 8, 2026
@apache apache deleted a comment from adriangbot May 8, 2026
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented May 8, 2026

show benchmark queue

@adriangbot
Copy link
Copy Markdown

Hi @kosiew, you asked to view the benchmark queue (#22075 (comment)).

Comment Repo PR User Benchmarks Status
#4404551613 apache/datafusion #22075 kosiew ["sql"] running

File an issue against this benchmark runner

@apache apache deleted a comment from adriangbot May 8, 2026
@kosiew kosiew marked this pull request as ready for review May 8, 2026 07:34
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented May 8, 2026

show benchmark queue

@adriangbot
Copy link
Copy Markdown

Hi @kosiew, you asked to view the benchmark queue (#22075 (comment)).

No pending jobs.


File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants