Skip to content

DNM: envd scalability work#36634

Draft
aljoscha wants to merge 88 commits into
MaterializeInc:mainfrom
aljoscha:envd-ddl-scalability
Draft

DNM: envd scalability work#36634
aljoscha wants to merge 88 commits into
MaterializeInc:mainfrom
aljoscha:envd-ddl-scalability

Conversation

@aljoscha
Copy link
Copy Markdown
Contributor

For running spec sheet

aljoscha and others added 30 commits May 13, 2026 11:30
The existing scenarios scale cluster size or envd CPU cores -- nothing
measures how adapter/envd latency moves as the catalog itself grows. Add
two scenarios under a new `envd_scalability` group that fix the
measurement cluster and vary the number of catalog objects.

`envd_scalability_tables` puts N empty tables in the catalog -- pure
catalog/adapter pressure, no controller load. `envd_scalability_mvs`
does N materialized views over a single 1-row base table -- same
catalog footprint, plus controller load proportional to N. The MV
scenario shards across single-replica pad clusters at 10000 MVs per
cluster (so 100k MVs spans 10 clusters), since one cluster can't
reasonably host that many dataflows.

For each N in {1, 10, 100, 1k, 3k, 5k, 10k, 20k, 30k, 50k, 100k} we run
10 reps each of `CREATE TABLE` (DDL through the coordinator) and
`SELECT * FROM <1-row table>` (a simple peek on a fixed 100cc cluster).
The catalog is built incrementally across size points, so going from
N=k to the next size point only adds (next - k) objects -- otherwise
we'd pay an O(sizes * N) build cost. The size list is overridable via
`--envd-scalability-sizes` for scaffolding runs.

Results land in a third CSV (`*.envd_scalability.csv`) reusing the
cluster CSV schema; `mode='envd_scalability'` distinguishes the rows.
Test analytics rides on the existing `cluster_spec_sheet_result` table
-- no schema change needed. The analyzer plots `time_ms` vs N per
(scenario, category, test_name).

This is going to be long-running, especially the MV scenario where each
create exercises the controller -- expect hours for the full size
range.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two new scenarios -- cluster_object_limits_indexes and
cluster_object_limits_mvs -- that find, per cluster size, the maximum
number of idle materializations one cluster can keep fresh.

The materializations are derived from a one-row, never-updated base
table so the only work the cluster has to do is keep advancing each
materialization's write_frontier in step with the upstream table. Once
the cluster can't keep up, freshness collapses; the driver records the
largest N at which `max(local_lag) < 2s` was still achievable, with the
unhealthy data point recorded too so the cliff is visible.

Staging-only (rejects --target=cloud-production), to avoid burning
production resources on long object-limit searches.
…lability default at 50k

When a materialization stalls completely (write_frontier never advances
past the minimum timestamp), `mz_internal.mz_materialization_lag` reports
`now() - 0` = current unix time in ms (~1.78e12). Recorded as-is this
crushes every healthy data point to ~0 on the plot. Cap the recorded
value at 10x the healthy threshold (= 20 s), preserve the underlying
truth via the `healthy` column, and label the plot to make the cap and
healthy threshold explicit.

Also drop 100_000 from the envd_scalability default size list: 50_000 is
a more sensible default ceiling for staging. The full size list is still
override-able via --envd-scalability-sizes for ad-hoc runs.
…tion

The release-qualification pipeline already runs three cluster-spec-sheet
groups (cluster_compute on production, source_ingestion on production,
environmentd on staging). Add two more groups -- envd_scalability and
cluster_object_limits -- both running against staging, since both push
the catalog / cluster to limits we don't want to exercise on production.
The three "envd / cluster" groups in the cluster-spec-sheet were named
inconsistently. Settle on the three concept names the cluster-spec-sheet
effort uses verbally:

  environmentd          -> envd_qps_scalability     (QPS vs envd CPU)
  envd_scalability      -> envd_objects_scalability (latency vs catalog N)
  cluster_object_limits -> cluster_object_limits    (unchanged)

Renames apply to: scenario constants, scenario-name string values, group
keys in SCENARIO_GROUPS, class names, the run/analyze function names,
the --envd-scalability-sizes CLI flag, the result CSV suffix, and the
`mode` field written into CSV rows. The pre-existing QPS scenarios keep
their individual `*_envd_strong_scaling` names since only the group is
renamed.

Also updates the release-qualification pipeline step ids/args and the
README to match.
…w start

When debugging cluster-spec-sheet runs on staging it's hard to tell which
environment we're actually talking to and whether the system parameter
defaults we expect (lifted via LaunchDarkly or similar) are actually
applied. Add a one-shot diagnostic right after target.initialize() that
prints mz_environment_id() and SHOWs the limits the test depends on
(max_tables, max_materialized_views, max_objects_per_schema, max_clusters,
max_credit_consumption_rate, memory_limiter_interval).

Best-effort: any probe error is logged and swallowed so a transient
failure does not abort the workflow.
psycopg3's execute() requires a LiteralString, so the f-string SHOW
query tripped pyright in CI. Compose the statement with
psycopg.sql.SQL/Identifier instead, matching the pattern already used in
test/orchestratord/mzcompose.py.
A staging run of `envd_objects_scalability_mvs` (release-qualification
build 1219) aborted at N≈19800 with:

    Retryable error: consuming input failed: SSL error: unexpected eof
    while reading, reconnecting...
    psycopg.errors.InternalError_: materialized view
    "materialize.pad_schema.pad_mv_19805" already exists

The TLS connection dropped mid-statement; envd had already committed the
CREATE but the response was lost. ConnectionHandler.retryable reconnects
and replays the same statement, which then fails with "already exists".

Use ``CREATE ... IF NOT EXISTS`` for every CREATE issued via _bulk_run so
the retry is a no-op. Affects the bulk-creation paths in both
envd_objects_scalability scenarios (tables, MVs) and both
cluster_object_limits scenarios (indexed views, MVs). Add a docstring on
_bulk_run spelling out the idempotency requirement so future CREATEs
don't reintroduce the hazard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 50k scale point pushes a single Envd Objects Scalability run past
the 13-hour mark on staging — adapter latency degrades so much by then
that each measurement repetition takes several seconds, and the catalog
build itself runs at <1/s. 30k is where the interesting signal already
lives. Drop 50k from the default list; ad-hoc runs that want it can
still pass --envd-objects-scalability-sizes explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits N-list defaults to a +1k linear step past
N=1000, which is too coarse: a run on staging showed the cliff sits in
(1000, 2000] for cluster_object_limits_indexes across every cluster
size 100cc..1600cc, and we can't tell from that whether the limit is
1100 or 1900.

After the coarse N-walk hits its first unhealthy point, bisect the
(last_healthy, first_unhealthy) interval --cluster-object-limits-bisect-
steps times (default 4) and probe each midpoint. The bisection step adds
or drops objects in place — never rebuilds the catalog — so the cost is
only ~bisect_steps extra hydrate-and-probe rounds per cluster size. With
the default 4 steps, the cliff narrows to ±~60 objects.

Adds:
- `remove_objects(target_n)` symmetric to `add_objects(target_n)` on
  both ClusterObjectLimitsScenario subclasses. Indexes scenario drops
  via DROP VIEW ... CASCADE (cascades to the default index); MVs
  scenario drops via DROP MATERIALIZED VIEW.
- `--cluster-object-limits-bisect-steps` CLI flag plumbed through to
  `run_scenario_cluster_object_limits`.
- Bisection block in the per-cluster-size loop that calls add+remove
  (one is a no-op) and records each probe under the same CSV schema, so
  the existing freshness-lag-vs-N plot just gets denser near the cliff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If CREATE CLUSTER fails for a cluster_object_limits size — because the
target region doesn't expose that replica size, or because allocating
the cluster would exceed max_credit_consumption_rate — today the
scenario either aborts with a traceback or (when the cluster is created
but then can't actually keep up) reports a confusing "unhealthy at
N=100" data point.

Catch psycopg.errors.DatabaseError around the CREATE CLUSTER, log a
clear "size unavailable" line (with the underlying error class +
message), and `continue` to the next cluster size. OperationalError is
re-raised so that genuine connection failures (which run_query's retry
loop has already given up on) aren't silently masked as a size problem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone Python harness that connects to a running environmentd, pads
the catalog with N objects of a chosen type (tables / views / mvs /
indexes), and times CREATE / DROP / ALTER / RENAME of various object
types at each scale point. Captures trace IDs from
emit_trace_id_notice so spans can be fetched from Tempo afterwards.

This is intended for profiling iteration, not CI tracking; the canonical
scaling numbers continue to live in test/cluster-spec-sheet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace CREATE CLUSTER IF NOT EXISTS (unsupported) with drop-then-create
  in both pad and measurement state setup.
- Default --pad-cluster-size and --meas-cluster-size to scale=1,workers=1
  (the smallest size accepted by a local envd; the 50cc cloud size only
  exists in cloud sizing maps).
- Add summarize_traces.py: bulk-fetch traces from Tempo and aggregate
  self-time per span name across a sampled (padding, scale, op) cell.
- NOTES.md: log two O(n) suspects identified by source-reading before
  any traces were captured:
    1. Transaction::allocate_oids walks all databases/schemas/roles/
       items/introspection_sources on every OID allocation — the comment
       at the function literally says "If DDL starts slowing down,
       this is a good place to try and optimize."
    2. Coordinator::validate_resource_limits does 5+ full walks of
       entry_by_id per CREATE (user_tables / user_sources / user_sinks /
       user_materialized_views / user_connections) — each is a filter
       over the whole catalog with no per-type bucket.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tables-padding audit at N in {0,500,2000,5000}: all CREATE/DROP/ALTER/
RENAME ops scale ~4 μs/object regardless of op, which says the bottleneck
is shared (catalog/coordinator) infrastructure, not op-specific.

Top per-N growers from the trace summary:
- storage::create_collections: +12.6 ms / 5000 = 2.5 μs/obj  (CREATE TABLE only)
- snapshot + transaction + consolidate (catalog durable): +7 ms / 5000
  = 1.4 μs/obj  (all ops)
- PersistTableWriteCmd::Append: +1.7-1.9 ms / 5000  (all ops)
- apply_catalog_implications_inner: ~1.1 ms appears at high N
- apply_updates: ~0.5-0.7 ms growth

Adds a third O(N) suspect to NOTES.md: TableTransaction::insert/update
both call for_values to scan all rows on every mutation — every catalog
durable txn insert walks the whole table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
views: half the slope of tables padding (+8 ms vs +23 ms at N=5000).
Tells us the per-N cost stacks two components:
- ~half scales with storage-collection count (paid only by ops that
  create/drop storage collections)
- ~half scales with catalog-entry count (paid by all ops)

mvs: super-linear, possibly quadratic. With 2000 MVs sharing a common
base table, CREATE TABLE jumps from 37 ms (N=0) to 2095 ms (N=2000) —
14x slowdown for 4x N.

Trace for the 2095 ms CREATE TABLE attributes 1.65 s of self-time to
storage::create_collections, none of which is in any instrumented
sub-span. Source-reading points at update_read_capabilities_inner
walking a MutableAntichain that accumulates one entry per MV holding a
read hold on the shared dependency, but we need targeted instrumentation
or a CPU profile to confirm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add #[instrument] / info_span! wrappers inside
storage_collections::create_collections_for_bootstrap to break down the
self-time that currently lands in storage::create_collections without
any sub-span:

- ccfb::open_persist_client: persist client open
- ccfb::open_data_handles_concurrent: buffer_unordered open_data_handles
- ccfb::install_collection_states: main per-collection install loop
- ccfb::synchronize_finalized_shards: post-loop finalize sync
- install_collection_dependency_read_holds_inner
- update_read_capabilities_inner
- acquire_read_holds
- append_shard_mappings (storage-controller side)

Lets us pinpoint which sub-phase is responsible for the super-linear
blow-up observed when N materialized views share a base table (CREATE
TABLE goes from 37 ms at N=0 to 2 s at N=2000 mvs).

Also bump Tempo's max_traces_per_user to 200k. Default 10k was rejecting
trace exports from envd during these audits, dropping the very traces we
need to analyse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adding sub-spans inside open_data_handles (odh::upgrade_version,
odh::open_critical_handle, odh::open_write_handle, odh::fetch_recent_upper)
shows that the super-linear blow-up under MV-shared-dependency padding
is entirely in persist's per-shard state initialization path:

at N=1000 MVs, for one CREATE TABLE:
  ccfb::open_data_handles_concurrent: 119 ms
    odh::upgrade_version:               41 ms (persist make_machine + cas)
    odh::open_critical_handle:          33 ms (persist make_machine)
    odh::fetch_recent_upper:             7 ms
    odh::open_write_handle:           <1  ms

Both upgrade_version and open_critical_handle bottom out in
StateCache::get -> Applier::new -> maybe_init_shard, which hits
CockroachDB via the persist consensus table. Per-shard queries are
indexed by primary key, but observed cost grows with the total number
of existing shards in CRDB. NOTES.md records the chain of suspicion
and concrete next steps for verification (samply, raw CRDB timing,
contention isolation). This is persist+CRDB territory, separable from
the catalog-side O(N) fixes already proposed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coordinator::validate_resource_limits runs before every DDL and was
making 5+ full walks of state.entry_by_id to count user items by type
(tables, sources, sinks, materialized views, secrets, connections by
sub-kind). At N catalog items each walk is O(N), so a DDL with N=5000
items burned ~25k OrdMap iterations of dead work.

Add three derived fields to CatalogState, maintained alongside
entry_by_id in insert_entry / drop_item:

  user_item_counts:       imbl::OrdMap<CatalogItemType, usize>
  user_source_shard_count: usize  (sum of user_controllable_persist_shard_count)
  user_connection_counts:  imbl::OrdMap<UserConnectionKind, usize>

UserConnectionKind mirrors the ConnectionDetails variants that the
limits care about (Kafka/Postgres/MySql/SqlServer/AwsPrivatelink) with
an Other bucket for the rest. All fields use imbl::OrdMap to keep
CatalogState's cheap Cow-clone semantics.

Bootstrap, normal apply, and the retract+addition update path all go
through insert_entry / drop_item, so the buckets stay in sync without
extra wiring. Existing iterator methods (user_tables(), ...) are left
in place for other callers.

Rewrites the connection sub-type loop and four .count() calls in
validate_resource_limits to use the new O(log N) getters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Transaction::allocate_oids` previously walked every row in the five
OID-bearing tables (databases, schemas, roles, items,
introspection_sources) on every single OID allocation, paying O(N) in
the size of the catalog. At N=5000 items this is ~50 μs of pure compute
per DDL — visible in DDL latency traces, with the original code's own
comment flagging this exact spot as the place to optimize once DDL
starts slowing down.

Cache the set of OIDs from the *initial* snapshot in a new
`Transaction::initial_oids: BTreeSet<u32>`, populated once in
`Transaction::new`. On each `allocate_oids` call, fold in the pending
changes for those five tables by walking only their `pending`
BTreeMaps (small in practice), computing per-key OID adds and removes
relative to the initial value. The integer scan loop then checks
membership in O(log N) per probe instead of O(N) per call.

`initial_oids` is not serialized in `current_snapshot`; it is rebuilt
when a transaction is restored via `Transaction::new`, since the
snapshot already reflects the merged initial+pending state at the
restore point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`TableTransaction::insert` previously called `for_values` for every
insertion, walking every initial+pending row of the table to (a) check
for a duplicate key and (b) run the optional uniqueness predicate. At
N existing rows this made each insert O(N), and was responsible for a
visible slice of the per-object cost on the DDL hot path observed
under envd-ddl-scalability.

Replace the duplicate-key scan with `self.get(&k).is_some()`, which
consolidates `initial`+`pending` for one key in O(log N + P_k). When
no uniqueness predicate is registered on the table (the case for the
majority of TableTransaction instances — 14 of the 22 constructions
use `TableTransaction::new`), skip the for_values walk entirely. When
one *is* registered, fall back to the existing full scan, with a code
comment pointing at the eventual fix (a uniqueness-key extractor + side
index).

Behaviour is preserved: the same inputs produce the same
`DurableCatalogError::DuplicateKey` / `UniquenessViolation` errors. The
only externally visible nuance is that when an insert would have hit
both conditions simultaneously, we now consistently surface
`DuplicateKey` first — the old code's precedence was iteration-order
dependent.

The `soft_assert_no_log!(self.verify().is_ok())` at the end of
`insert` is itself O(N) (and O(N^2) when the value type doesn't
implement `HAS_UNIQUE_NAME`), but it's gated behind
`soft_assertions_enabled()` and so is debug/test only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tables-padding: catalog-side O(N) slope eliminated. CREATE TABLE / DROP
TABLE / ALTER / RENAME / CREATE VIEW / DROP VIEW all show ~0 ms Delta
from N=0 to N=5000 (previously +17-23 ms each).

MVs-padding: catalog-side cost reduced ~2-3x (CREATE TABLE @n=2000:
2095 -> 851 ms) but super-linear remainder is intact, matching the
prediction that persist's per-shard init path dominates when many
storage collections share dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first post-fix run compared a cold (freshly --reset) envd against
a warm pre-fix run, which made the N=0 baseline look 30 ms higher and
the slope look flat. Re-ran on a properly-warm envd: pre/post slopes
are essentially identical (~+20 ms from N=0 to N=5000 across all
ops). The three fixes were correct but their combined contribution to
the per-DDL slope is below measurement noise.

So the named O(N) loops (validate_resource_limits, allocate_oids,
TableTransaction::insert) are NOT the dominant source of the ~4 μs/
object slope. NOTES.md now lists the remaining suspects we hadn't
pinpointed (catalog durable snapshot, compare_and_append on the
catalog shard, apply_catalog_implications_inner).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh tables-padding audit on warm post-fix envd (N=0, N=5000, reps=8)
shows the dominant per-DDL O(N) cost is in catalog durable transaction
machinery, not in the named loops the three earlier fixes targeted:

  span                              N=0      N=5000   Δ
  snapshot (catalog durable)        509μs    4.1ms    +3.6ms
  transaction (catalog durable)     426μs    3.0ms    +2.6ms
  consolidate (catalog durable)     449μs    2.1ms    +1.7ms
  group_commit_apply::append_fut    5.0ms    7.6ms    +2.6ms
  apply_catalog_implications_inner  <0.5ms   1.2ms    +1.2ms
  apply_updates                     536μs    1.4ms    +0.9ms

snapshot + transaction + consolidate together account for ~1.5 μs/object
of the slope. Code locations:

  with_snapshot   src/catalog/src/durable/persist.rs:766
  Transaction::new src/catalog/src/durable/transaction.rs:128
  consolidate     src/catalog/src/durable/persist.rs:706

with_snapshot rebuilds a Snapshot { databases, schemas, items, ... } of
all BTreeMaps from scratch per transaction; Transaction::new then walks
every row again to do proto→Rust conversion into TableTransaction.initial.
Three full O(N) walks per DDL. This is exactly what the proposed
DurableCatalogData (Arc<imbl::OrdMap> per table) + per-txn overlay
design eliminates.
Spec for making single-statement DDL latency flat in catalog size, by
replacing per-transaction Snapshot materialization with shared, indexed
DurableCatalogData (Arc<imbl::OrdMap> per table) and a per-transaction
overlay. Includes audit findings from the envd-ddl-scalability branch
that motivate each part of the design, and a 7-step rollout plan with
per-step regression signals on the audit harness.
Step 1 of the O(delta) catalog transaction design
(`doc/developer/design/20260515_ddl_catalog_o_delta.md`).

Today every single-statement DDL re-materialises the full catalog twice
per transaction: `with_snapshot` walks the consolidated trace into a
fresh proto-typed `Snapshot { databases: BTreeMap, ... }`, and
`Transaction::new` walks every row of that snapshot again to do a
proto->Rust conversion into per-table `BTreeMap`s on the
`TableTransaction`. With N=5000 catalog items the `snapshot` and
`transaction` durable spans together add ~7 ms per DDL.

Introduce `DurableCatalogData`: one `Arc<imbl::OrdMap<K, V>>` per
durable-catalog table, holding Rust-typed values. It is maintained
incrementally by `CatalogStateInner::apply_update` — each persisted
state update inserts or removes from the matching map through
`Arc::make_mut`, so transactions holding older `Arc`s observe a frozen
view.

Rewrite `TableTransaction` to overlay-style: `base:
Arc<imbl::OrdMap<K, V>>` plus the existing `pending: BTreeMap<K,
Vec<TransactionUpdate<V>>>`. Reads probe `pending` then fall through to
`base`; writes only touch `pending`. Constructing one is now an
`Arc::clone`, no proto conversion.

`PersistCatalogState::transaction()` no longer calls `with_snapshot`.
It clones the shared `DurableCatalogData` (one `Arc::clone` per field)
and hands it to a new `Transaction::new_from_durable_data`. The old
`Transaction::new(_, Snapshot, _)` remains for `transaction_from_snapshot`
and external callers; it converts the proto-typed Snapshot to a
`DurableCatalogData` and delegates to the new constructor, paying the
O(N) proto->Rust cost only for those paths.

The trace `PersistHandle::snapshot: Vec<...>` is kept as-is — other
callers (`with_trace`, `get_next_id`, `Trace::from_snapshot`,
`persist_snapshot`) still depend on it. `DurableCatalogData` is
maintained alongside, not in place of, the trace.

Indexes, resource counts, shared OID sets, and the transaction
ownership rework are subsequent steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 2 of the O(delta) catalog transaction design
(`doc/developer/design/20260515_ddl_catalog_o_delta.md`).

Add `DurableCatalogIndexes` to `DurableCatalogData`:

- `database_by_name`, `schema_by_parent_name`, `role_by_name`,
  `cluster_by_name`, `replica_by_cluster_and_name`,
  `network_policy_by_name`: one `Arc<imbl::OrdMap<Name, Key>>` per
  table that has a name-based uniqueness predicate.
- `item_by_namespace: Arc<imbl::OrdMap<(SchemaId, String),
  imbl::Vector<(ItemKey, CatalogItemType)>>>` — the items table's
  uniqueness predicate is composite (a Type item co-exists with a
  non-conflicting Func item at the same `(schema, name)`), so the
  bucket is a small list of candidates, walked pairwise.
- `oid_owner: Arc<imbl::OrdMap<u32, OidOwner>>` covers all five
  OID-bearing tables (databases, schemas, roles, items, intro
  sources). Not wired in yet — step 4 replaces the per-txn
  `initial_oids` walk with this.

The indexes are maintained in `CatalogStateInner::apply_update`,
in lock-step with the durable maps, via `Arc::make_mut`.

Plumb a per-`TableTransaction` `index_probe: Option<Box<dyn Fn(&V) ->
Vec<K>>>` so `insert` and `verify_keys` can replace the O(N)
`for_values` walk over `base` with an O(log N + collisions) index
probe plus a small walk over `pending`. The fallback nested-loop scan
stays for tables with a uniqueness predicate but no index. Wire all
seven indexed tables in `Transaction::new_from_durable_data` via
`Arc::clone`s of the index maps.

`snapshot_to_durable_data` builds the indexes once from the converted
tables — paid by callers that go through proto-`Snapshot`
(`transaction_from_snapshot`, tests), not the hot `transaction()` path.

Two test items (`test_persist_items`,
`test_persist_ddl_detection_with_batch_allocated_ids`) hard-coded OIDs
in the 20_000 range that collide with bootstrap-assigned OIDs
(materialize db, public schema). The new strict `oid_owner` insert
assertion correctly rejects them, so bump test OIDs to 30_000+ rather
than weaken the invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 4 of the O(delta) catalog transaction design.

`Transaction::new_from_durable_data` was still building a per-txn
`initial_oids: BTreeSet<u32>` by walking every value in the five
OID-bearing tables (databases, schemas, roles, items, intro sources).
That's residual O(N) work in the otherwise O(delta) hot path.

Replace `initial_oids: BTreeSet<u32>` with `initial_oid_owner:
Arc<imbl::OrdMap<u32, OidOwner>>`, populated by `Arc::clone`ing the
shared `oid_owner` index that step 2 added. The probe in
`allocate_oids::is_allocated` becomes `initial_oid_owner.contains_key`
— O(log N) point lookup instead of O(1) on a freshly-built set, but
the build cost drops from O(N) to O(1) per transaction.

`pending_added` / `pending_removed` still cover this transaction's
in-flight allocations / removals; only the initial-state view changes
representation. No behaviour change for callers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… stmts

Step 5 of the O(delta) catalog transaction design.

Two coupled changes.

(A) Drop the `'a` borrow on `Transaction`.

`Transaction<'a>` held `&'a mut dyn DurableCatalogState` only to:
- read `is_bootstrap_complete()` in two debug asserts,
- read `is_savepoint()` for `Transaction::is_savepoint()`,
- hand the reference back out of `into_parts` so `commit_internal`
  could call `commit_transaction`.

Capture the two booleans at construction. Make `commit` and
`commit_internal` take `storage: &mut dyn DurableCatalogState` as an
argument. `Transaction` becomes owned (no lifetime), so it can be
moved into `TransactionOps::DDL` and held across `await` points
without a borrow on the catalog storage.

`Transaction::new` / `new_from_durable_data` now take the two
booleans directly. The new
`DurableCatalogState::transaction_from_durable_data(data)` method
mirrors `transaction_from_snapshot(snapshot)` but skips the O(N)
proto→Rust conversion.

`StorageTxn for Transaction` (no lifetime) follows trivially.

`Transaction::current_durable_data(&self)` is the new
proto-conversion-free counterpart of `current_snapshot()`. Each
`TableTransaction` folds its `pending` into its `base` (`Arc::clone`
when pending is empty), and the indexes are rebuilt with
`DurableCatalogIndexes::from_tables`.

(B) Carry `DurableCatalogData` in `TransactionOps::DDL`.

The dry-run loop in `transact_incremental_dry_run` previously
checkpointed state as a proto `Snapshot` and called
`transaction_from_snapshot(prev_snapshot)` on the next statement —
an O(N) walk per statement of a multi-statement DDL transaction.

Replace `TransactionOps::DDL { snapshot: Option<Snapshot> }` with
`durable_data: Option<DurableCatalogData>`, and route through
`transaction_from_durable_data` / `current_durable_data`. Each
statement is now O(|pending|) instead of O(catalog).

The proto `Snapshot` type and the trait methods that use it remain;
they're public surface used by `catalog/tests/*` and any external
consumer of the `DurableCatalogState` trait.

Mechanical sweep through `src/adapter/src/catalog/{open,migrate,
open/builtin_schema_migration}.rs`,
`src/adapter/src/catalog/transact.rs`,
`src/adapter/src/coord/{ddl,command_handler,sequencer/inner}.rs`
to drop the `'_` lifetime from `Transaction<'_>` everywhere and to
thread `&mut **storage` (or `&mut *state` in catalog tests) into
`commit` / `commit_internal` calls.

`DurableCatalogState::allocate_id` was a default method that called
`txn.commit_internal(commit_ts)` against `&mut self`, which can't be
unsized to `&mut dyn DurableCatalogState` without making the method
non-object-safe. Rewrite the body to use `into_parts` +
`commit_transaction` directly. `allocate_id` only writes a single
`id_allocator` row, so skipping `commit_internal`'s in-memory
consolidation pass is a no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ications

Step 6 of the O(delta) catalog transaction design.

The "post-fix high-N trace ranking" in
test/envd-ddl-scalability/NOTES.md identified
`apply_catalog_implications_inner` as a ~+1.2 ms grower at N=5000,
shared by every DDL op type. The culprit is here:

The per-timeline loop in `gather_timeline_associations` walked every
read hold in `read_holds.storage_ids()` / `.compute_ids()` and asked
whether each was a member of the (typically tiny) drop set —
effectively O(N) per timeline. With ~5000 storage holds in the
default `EpochMilliseconds` timeline, every DDL paid this scan,
regardless of whether it dropped anything.

Two follow-on costs compounded it:
- `compute_gids_to_drop` was a `Vec`, so `.contains((id, gid))` was
  O(M) per element — O(N·M) inside the loop.
- The emptiness check then rebuilt `read_holds.id_bundle()` and called
  `.difference(&id_bundle).is_empty()` — another full O(N) walk per
  timeline.

Fix: probe the small drop sets into the BTreeMap-backed
`storage_holds` / `compute_holds`, making each scan O(|drop_set| · log N).
Convert `compute_gids_to_drop` to a `BTreeSet`. Replace the
`difference` emptiness check with a length comparison: id_bundle is
built from elements that exist in read_holds, so
`|read_holds \ id_bundle| == 0` iff sizes match — O(1) given map sizes.

Whole-cluster drops are rare and not naturally indexed; keep that
case's linear scan but gate it on `clusters_to_drop_set.is_empty()`.

`apply_updates` was audited end-to-end; no clear single O(N) loop in
the per-update handlers — the remaining ~+0.9 ms in that span is
likely tracing overhead and small allocation pressure across the
debug-instrumented apply path, not a fixable hot loop. Left for
follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aljoscha and others added 29 commits May 18, 2026 21:29
…cute is now flat

Validation bench at N=5k/10k/15k after 5d2d138 (remove O(n) table
advancement loop from group_commit). create_p50 dropped by 5.6 / 9.5 /
14.3 ms across the three scales; per-+5k slope reduction is 38% / 29%.

coord_builtin_table_execute itself is now essentially flat: mean per
call goes 3.80 -> 3.91 -> 4.59 ms (was 8.26 -> 12.40 -> 16.85 pre-fix).
Slope per +5k inside the timer drops from +4.14 / +4.45 ms to
+0.11 / +0.68 ms.

The remaining ~+11 ms-per-+5k headline slope is now spread across
transact_inner, tx_commit, and apply_catalog_implications. Largest
single absolute cost per DDL is apply_catalog_implications at ~11.6 ms
(flat). That's the next investigation target: a sub-phase split to
attribute where the constant 25%-of-DDL cost lives.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a new HistogramVec `mz_apply_catalog_implications_phase_seconds`
labeled by phase. apply_catalog_implications had only a single outer
timer (~11.6 ms per call regardless of N) so we could not tell what
fraction of that constant cost lives in each region.

Phases captured:
* absorb_updates — the implication batching loop at the top of
  apply_catalog_implications
* inner_total — the whole call into apply_catalog_implications_inner
  (split below)
* inner_item_loop — the `for (catalog_id, implication) in implications`
  loop that walks per-item implications
* inner_cluster_loops — the cluster + cluster_replica command loops
* inner_controller_setup — the post-loop calls into the controllers:
  create_source_collections, create_table_collections,
  initialize_storage_collections, vpc_endpoints, alter_*
* inner_dependency_scan — active-sink/peek/copy cleanup + timeline
  association rebuilds
* inner_finalize — the "no error returns allowed" async block: drops,
  retires, background secret/replication-slot cleanup

Same pattern as `mz_catalog_transact_phase_seconds`: histogram cloned
locally, RAII timers via `start_timer()`. No behavioral change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cations

Sub-phase split of mz_apply_catalog_implications_seconds at
N=5k/10k/15k. inner_controller_setup (i.e. create_table_collections
+ initialize_storage_collections, etc.) is 84% of inner_total and
carries ~93% of the slope.

CREATE-only inner_controller_setup is 16.32 / 17.13 / 19.70 ms
across scales. That's the single call into
controller.storage.create_collections opening a fresh persist
WriteHandle + SinceHandle for the new table shard, then a
compare_and_downgrade_since on it.

DROP-only inner_finalize is ~3.6 ms — drop_tables -> txn-wal
append. Mostly flat.

Everything else (absorb_updates, item loop, cluster loops,
dependency scan) is microseconds in this workload and will not
matter until we have real user-cluster activity.

Next: split create_table_collections / create_collections into
phases (write-ts grab, advance_upper, open handles,
downgrade-since, install collection state) to attribute the 20 ms
CREATE cost.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add four sub-phase labels to mz_apply_catalog_implications_phase_seconds:
- create_table_write_ts (get_local_write_ts)
- create_table_advance_upper (catalog.advance_upper)
- create_table_storage_create_collections (controller.storage.create_collections)
- create_table_apply_local_write (apply_local_write)

inner_controller_setup is 16+ ms per CREATE at N=15k and carries the
dominant slope inside apply_catalog_implications. This split tells us
which of the four steps is responsible.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sub-phase split of create_table_collections at N=5k/10k/15k. Inside
apply_catalog_implications's inner_controller_setup (16-22 ms per
CREATE), the four named steps break down as:

- storage.create_collections: 8.16 -> 8.98 -> 11.57 ms (slope +0.82 / +2.59)
- get_local_write_ts:         3.27 -> 3.62 -> 3.61 ms (flat ~3.6)
- apply_local_write:          2.88 -> 3.12 -> 3.29 ms (flat ~3)
- catalog.advance_upper:      0.56 -> 0.61 -> 0.71 ms (small slope)

storage.create_collections is both the dominant absolute cost AND the
dominant slope owner (74% of inner_controller_setup slope). Next
target: split storage_collections::create_collections_for_bootstrap
into open_data_handles, compare_and_downgrade_since, and the
collection-install loop. Existing info_span!s mark the boundaries.

Secondary finding: CREATE TABLE makes two get_local_write_ts calls
(one in catalog_transact_inner, one in create_table_collections),
each ~3 ms. Worth a design review of whether the second is required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…strap

Split storage.create_collections (the slope owner inside CREATE TABLE
controller setup, identified in commit 7491526) into two layers of
sub-phases:

storage_collections.create_collections_for_bootstrap:
 - validate_and_enrich
 - open_persist_client
 - open_data_handles_concurrent (buffer_unordered 50)
 - sort
 - install_collection_states (under collections mutex)
 - synchronize_finalized_shards

storage_controller.create_collections_for_bootstrap:
 - storage_collections_call (inner call to the above)
 - validate_and_enrich
 - open_persist_client
 - open_data_handles_concurrent (the second buffer_unordered, controller's
   own per-collection write handle)
 - register_loop (per-collection for loop with acquire_read_holds)
 - init_source_statistics
 - table_register (persist_table_worker.register batched call)
 - append_shard_mappings
 - run_to_execute

Two new HistogramVecs registered:
 - mz_storage_collections_create_collections_phase_seconds
 - mz_storage_controller_create_collections_phase_seconds

Both keyed on phase label. Will use to attribute the 11.57 ms / +2.59 ms
per +5k slope inside controller.storage.create_collections at N=15k.
… is in Catalog::transact

Phase 7 ran the new mz_storage_collections / mz_storage_controller
create_collections_phase_seconds histograms at N=5k/10k/15k. Findings:

1. storage.create_collections is flat at ~9 ms per CREATE TABLE across
   all three scales. The phase 6 measurement (11.57 ms at N=15k with
   +2.59 ms per +5k slope) was run-to-run variance on a single data
   point, not a real slope.
2. Inside storage_collections::create_collections_for_bootstrap the
   only sub-phase with a visible slope is install_collection_states
   (0.19 -> 0.53 -> 1.54 ms). That's the post-stream loop running under
   the collections mutex; the work is BTreeMap inserts plus channel
   sends. Probably not worth fixing at current N.
3. open_data_handles_concurrent (the buffer_unordered stream that opens
   the SinceHandle + WriteHandle pair) is the biggest absolute cost at
   ~5 ms/CREATE but it's flat. The per-shard persist work doesn't grow
   with the number of already-registered shards.
4. apply_catalog_implications.inner_controller_setup is also flat now
   (9.43 -> 9.44 -> 9.72 ms). Phase 5's slope claim was either fixed
   between then and now, or also variance on a single N=15k point.

Where the slope actually lives now: Catalog::transact, specifically
transact_inner (+2.5 ms/+5k), tx_commit (+1.5 ms/+5k), op_loop
(+0.8 ms/+5k), and the apply_updates family (+1.8 ms/+5k combined).
coord_inner_total is +5.48 ms per +5k tables per call; doubled across
CREATE+DROP that's +11 ms/+5k per DDL, which fully accounts for the
observed create_p50 slope of +9-10 ms/+5k.

Next iteration target: instrument tx_commit and the StateUpdate
appliers inside Catalog::transact.
Adds two HistogramVecs to drill into tx_commit, which phase 7 attributed
+1.51 ms / +5k slope to:

  mz_catalog_commit_transaction_phase_seconds{phase}
    - caa_fence_check
    - caa_encode
    - caa_persist_caa_inner    (the inner compare_and_append_inner wrapper)
    - caa_persist_compare_and_append (the actual persist write_handle.CaA)
    - caa_since_downgrade      (since handle maybe_compare_and_downgrade_since)
    - caa_post_sync            (the sync(next_upper) call after CaA)

  mz_catalog_sync_phase_seconds{phase}
    - listen_fetch     (listen.fetch_next, summed across iterations)
    - apply_updates    (apply_updates, summed across timestamps)
    - consolidate      (maybe_consolidate + final consolidate)

sync_phase_seconds aggregates across all iterations within a single
sync_inner call, so each tx_commit produces one sample per phase, not
one per listen event.

Will pin down whether the +1.51 ms / +5k tx_commit slope lives in the
persist CaA, the post-CaA sync, or the in-memory consolidate.
Phase 8 attributed the +7 ms/+5k slope in tx_commit at N=5k -> N=10k
to mz_catalog_sync_phase_seconds{phase="consolidate"}:

  N=5k:   0.64 ms/call x 6 calls/DDL = 3.82 ms/DDL
  N=10k:  1.87 ms/call x 6 calls/DDL = 11.20 ms/DDL
  N=15k:  2.39 ms/call x 6 calls/DDL = 14.37 ms/DDL

Root cause: sync_inner ended with an unconditional consolidate(), which
does O(N log N) work on the entire snapshot. For the steady-state case
(one timestamp per sync, ~5-10 new entries added to a 15k-entry snapshot)
this ran on every DDL commit, paying the full sort + dedup cost for a
trivial delta.

The doubling-threshold maybe_consolidate (added in MaterializeInc#36233) was supposed
to amortize this, but it never triggered: sync_inner reset
size_at_last_consolidation to None at the top of every call, so the
threshold was always re-baselined against the current snapshot size.
A single DDL never grows the snapshot by 2x, so maybe_consolidate
inside the loop did nothing — and then the unconditional consolidate
at the end paid the full O(N log N) cost.

Two changes:
 1. Drop the size_at_last_consolidation = None reset at the top of
    sync_inner. The doubling threshold is meant to amortize across the
    snapshot's lifetime, not per sync_inner invocation.
 2. Replace the unconditional self.consolidate() at the end with
    self.maybe_consolidate(). Combined with the per-ts maybe_consolidate
    inside the loop, this keeps memory bounded at 2x the last
    consolidated size while making per-call cost amortized O(log N)
    instead of O(N log N).

Verified existing tests still pass:
  test_persist_sync_consolidation_not_quadratic     ok
  test_persist_sync_snapshot_stays_bounded_under_churn  ok

The "stays bounded under churn" test (200 renames of one DB) still passes
because the persistent threshold + per-ts maybe_consolidate fires every
~log(N) steps. The "not quadratic" test still passes because total
consolidations during a 100-ts sync stay well under the test's bound of 10.
Phase 8 split tx_commit into commit_transaction phases + sync phases.
Found mz_catalog_sync_phase_seconds{phase="consolidate"} at:
  N=5k:  3.82 ms/DDL
  N=10k: 11.20 ms/DDL
  N=15k: 14.37 ms/DDL

Phase 9 (post-fix in commit 00d31c5) shows consolidate flat at 0 ms
across all N, tx_commit per-call flat at ~1.13 ms (was 2.92 -> 5.54).
create_p50 at N=15k dropped from 48.93 to 44.51 ms (-4.42 ms);
slope at 10k->15k dropped from +9.11 to +5.81 ms/+5k (35% reduction).

The residual slope has moved entirely to in-memory state-apply paths
(transact_inner outer, op_loop, apply_updates family). Next iteration
target: profile CatalogState::apply_updates for the per-update walk
that still scales with N.
Phase 9 confirmed the catalog `consolidate` slope is gone. The residual
+5.81 ms/+5k slope at 10k→15k is still inside Catalog::transact_inner,
specifically in `apply_updates` and family. We have no visibility into
which sub-step of apply_updates owns that slope.

Add two new histograms:

  * mz_catalog_apply_updates_phase_seconds{phase}
      One observation per apply_updates call, per sub-phase:
        - consolidate_initial (the per-call consolidate_updates)
        - sort_per_group       (sort_updates per timestamp group)
        - apply_updates_inner  (the kind-dispatch loop)
        - cleanup_notices      (drop_optimizer_notices + pack)

  * mz_catalog_apply_update_kind_seconds{kind}
      One observation per StateUpdate inside apply_updates_inner,
      labeled by StateUpdateKind variant (item, schema,
      storage_collection_metadata, etc.). The 1e-5 lower bucket lets us
      see individual sub-microsecond updates so per-kind contributions
      add up correctly across hundreds of updates per DDL.

Strong working hypothesis for the slope: `Arc::make_mut` on
`storage_metadata` (whose inner `collection_metadata` is a non-persistent
`BTreeMap<GlobalId, ShardId>`) does a full O(N) clone every time the
Arc is shared between `preliminary_state` and `state` Cow's — which
happens twice per DDL through the op_loop + final_apply_updates path.
At N=15k that's ~5 ms of pure clone work, matching the observed slope.

The next phase will run the bench and confirm or refute this from the
per-kind data.
Phase 10's per-kind apply_updates instrumentation pointed at
`StateUpdateKind::Item` as the dominant slope owner (+850 us per call
per +10k tables, on 4 calls per DDL). The work that scales with N is
inside `apply_item_update` / `insert_entry` / `drop_item`, which call
`self.get_schema_mut(...)` to walk
`database_by_id.get_mut(...).schemas_by_id.get_mut(...)`.

`schemas_by_id` is `imbl::OrdMap<SchemaId, Schema>`, which is shared
between `preliminary_state` and `state` Cow's in `transact_inner`.
`get_mut` on a shared `imbl::OrdMap` path-copies the affected B-tree
leaf and **clones every value in the leaf**, not just the targeted
one. Schema embedded three non-persistent maps:

  pub items: BTreeMap<String, CatalogItemId>,
  pub functions: BTreeMap<String, CatalogItemId>,
  pub types: BTreeMap<String, CatalogItemId>,

At N=15k the audit_pad schema's `items` has 15k entries. The B-tree
leaf containing the materialize-database schemas almost certainly fits
in one imbl chunk, so any `apply_item_update` (even mutating
audit_meas, not audit_pad) leaf-copies audit_pad's Schema and clones
its 15k-entry BTreeMap. That's the O(N) memcpy+tree-build per call
that the phase 10 per-kind metric attributed to `item`.

Switching items/functions/types to `imbl::OrdMap<String, CatalogItemId>`
makes the per-leaf-clone path O(1) (refcount bump on a persistent
tree root). All call sites use only operations common to both types
(get / insert / remove / contains_key / is_empty / len / values /
iter), so it's a drop-in. The fn pointer signature in
`CatalogState::resolve` is updated to match.

Phase 11 bench follow-up will validate that `item` mean per call
flattens with N.
…ections

Phase 10's per-kind apply_updates instrumentation showed
`StateUpdateKind::StorageCollectionMetadata` had a clean
+200 us/call slope at +5k tables (2 calls per DDL, ~+0.4 ms/DDL).

`StorageMetadata` lives behind `Arc<StorageMetadata>` on `CatalogState`.
The `preliminary_state`/`state` Cow pattern in `Catalog::transact_inner`
shares this Arc, so the first `Arc::make_mut(storage_metadata)` per
independently-owned `CatalogState` deep-clones its fields:

  pub collection_metadata: BTreeMap<GlobalId, ShardId>,
  pub unfinalized_shards: BTreeSet<ShardId>,

With N=15k tables, `collection_metadata` has ~15k entries, and the
`BTreeMap` clone is O(N). At 2 make_mut'd `CatalogState`s per DDL,
that's two full clones per DDL.

Switch both to imbl::OrdMap / imbl::OrdSet so the clone is O(1)
(persistent tree refcount bump). All external callers use only
operations common to both (.get(), .contains(), .iter()). The only
local API divergence: `imbl::OrdSet::insert/remove` return
`Option<T>`, not `bool` — `apply_unfinalized_shard_update` is
adjusted accordingly.

Same reasoning as ad197b0 (Schema.items/functions/types), applied
to the storage-side analogue.
Phase 10 instrumentation (00025cb) split apply_updates into
sub-phases + per-StateUpdateKind, attributing the +1.94 ms/+5k
apply_updates_inner slope at 10k→15k to two non-persistent
collections living behind shared imbl/Arc handles:

  1. Schema.items/functions/types: BTreeMap<String, CatalogItemId>
     — cloned via imbl::OrdMap leaf path-copy in get_schema_mut.
     At N=15k the audit_pad schema's items had 15k entries; every
     apply_item_update clone cost O(N).
  2. StorageMetadata.{collection_metadata, unfinalized_shards}:
     BTreeMap/BTreeSet behind Arc<StorageMetadata>. The
     preliminary_state/state Cow split in transact_inner forces
     Arc::make_mut to deep-clone these on each owned CatalogState.

Phase 11 (ad197b0) and phase 12 (4b6f5d1) swap both to
their imbl persistent counterparts.

Results at N=15k:
 - item kind: 1436 -> 242 us/call (-83%)
 - storage_collection_metadata kind: 567 -> 5.12 us/call (-99%)
 - apply_updates_inner slope (10->15k): +1.94 -> +0.18 ms/DDL
 - create_p50 (15k): 48.93 (phase 7) -> 42.84 (phase 12), -6.09 ms

NOTES.md captures the per-phase tables and writes up the
generalizable pattern: inline value types stored inside an
imbl::OrdMap (or behind shared Arc) silently lose the O(1)
clone property to any non-persistent sub-collection field.
…ue pattern

Read-only audit of every imbl::OrdMap in the catalog/controller hot
paths, looking for the same "outer is persistent, inner is not"
shape that drove phases 11 and 12.

Findings, ranked:

HIGH (same shape, same hot path, drop-in fix):
 - Database.{schemas_by_id, schemas_by_name}: BTreeMap inside
   Cluster Cluster -> wrong, Database value held in
   imbl::OrdMap<DatabaseId, Database>. get_schema_mut is on the
   every-apply_item_update path. Worth landing next.

MEDIUM (workload-dependent, not exercised by the audit_pad bench):
 - Cluster.bound_objects: BTreeSet<CatalogItemId> grows with every
   object bound to that cluster. Invisible in this bench (plain
   tables don't bind to a cluster), but expected to slope under MV /
   index workloads on a single cluster.
 - Cluster.replica_id_by_name_ / replicas_by_id_ / log_indexes:
   small in practice.

LOW (small or workload-specific):
 - Role.{vars.map, membership.map}: per-role counts are small.
 - SourceReferences.references: grows with one source's refs, not N.
 - CatalogEntry.{referenced_by, used_by}: small per entry, but the
   16-entry imbl leaf clone of entry_by_id also re-clones
   CatalogItem (incl. optimized/physical plans for MV/Index/CT) for
   sibling entries; matters under MV scale, not table scale.
 - notices_by_dep_id: value is Vec<Arc<_>>, shallow clone is cheap.

Recommended next step: land the Database BTreeMap -> imbl::OrdMap
swap, then design a real MV/index scale bench before considering
the MEDIUM tier. The existing
mz_catalog_apply_update_kind_seconds{kind} histogram is the
canonical signal for whether each tier is worth fixing.
…istent collections

Phase 11+12 fixed the two non-persistent inner collections that the
audit_pad bench exposed (Schema.items and StorageMetadata.collection_metadata).
This commit applies the same fix to the remaining sites identified
by the read-only sweep in 351ddb9:

  Database.{schemas_by_id, schemas_by_name}: BTreeMap -> imbl::OrdMap
  Cluster.{bound_objects, replica_id_by_name_, replicas_by_id_}:
    BTreeMap/BTreeSet -> imbl::OrdMap/imbl::OrdSet
  RoleMembership.map: BTreeMap -> imbl::OrdMap
  RoleVars.map: BTreeMap -> imbl::OrdMap
  SourceReferences.references: Vec -> imbl::Vector

All of these live inside imbl::OrdMap<K, V> on CatalogState (or
transitively inside such a value type), so the
preliminary_state/state Cow split in Catalog::transact_inner forces
imbl leaf path-copies to deep-clone them. The audit_pad bench
doesn't exercise these paths (plain tables don't touch clusters /
roles / sources), but the same shape that produced the +5 ms/DDL
slopes in phases 11+12 would surface on cluster-heavy / role-heavy
workloads.

Two intentional non-changes:

  * Cluster.log_indexes stays BTreeMap<LogVariant, GlobalId>: tight
    API contract with the compute controller (arranged_logs:
    BTreeMap<...>) and bounded by ~10 log variants.
  * CatalogEntry.{referenced_by, used_by} stay Vec<CatalogItemId>:
    the mz_sql::catalog::CatalogItem trait returns &[CatalogItemId]
    for them, which imbl::Vector can't satisfy (no Deref<[T]>).
    Per-entry counts are small and the dominant per-entry clone
    cost during entry_by_id leaf path-copy is item: CatalogItem,
    not these vectors.

Trait signatures in mz_sql::catalog::{CatalogDatabase, CatalogRole,
CatalogCluster} change to match (&imbl::OrdMap / &imbl::OrdSet).

The per-field "why imbl" comments on Schema and StorageMetadata are
removed and consolidated into a single rule-block doc comment on
CatalogState explaining the pattern, the two slopes that motivated
it, and the rule for new fields. Two intentional holdouts are
called out there as well.
…on tables

The phase 13 sweep (1a8446c) switched five more inline-in-imbl::OrdMap
collections off of BTreeMap/BTreeSet/Vec, but the audit_pad bench
(plain CREATE/DROP TABLE) doesn't exercise the cluster, role, or
source paths — so the expected outcome was "no measurable change,
no regression." That's what we got.

item @ N=15k: 242 -> 238 us/call. storage_collection_metadata
unchanged at ~5 us/call. apply_updates_inner total at N=15k flat
at 1.48 ms/DDL. Run-to-run noise dominates.

The sweep is landmine-prevention for cluster-heavy / role-heavy /
source-heavy workloads where the same leaf-clone pattern would
surface a slope on those DDL kinds. A future scale bench targeting
CREATE INDEX / MV / GRANT is the right way to actually measure
those payoffs; the existing per-kind histogram is the canonical
signal.
Adds mz_storage_collections_prepare_state_phase_seconds{phase} so we can
attribute the prepare_state slope (currently 2.4 ms/call at N=15k user
tables, growing roughly linearly) to its sub-phases.

Sub-phases:
 - insert_add, insert_register, delete: txn mutations.
 - dropped_shard_lookup: self.collections.lock() acquire + the loop
   over dropped_mappings.
 - insert_unfinalized: txn.insert_unfinalized_shards.
 - mark_finalized: self.finalized_shards.lock() + txn.mark_shards_as_finalized.

Companion to mz_catalog_transact_phase_seconds (outer) and
mz_apply_catalog_implications_phase_seconds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the txns shard upper advances, BackgroundTask::run propagates the
new upper to every txns-backed user table by calling
self.update_write_frontiers(...) with one entry per table. The previous
implementation held self.collections (the single global mutex shared
with every DDL path, including catalog prepare_state) for the full O(N)
walk.

At N=15k user tables that meant one txns-upper tick = ~10+ ms of held
mutex, plus the BackgroundTask staying CPU-bound while it held it.

Phase-level bench (mz_storage_collections_prepare_state_phase_seconds)
attributed the slope to exactly this: dropped_shard_lookup (which only
does self.collections.lock() + an empty for-loop in our CREATE TABLE
workload) went from 1.6 µs/call at N=5k to 599 µs/call at N=10k,
matching the OUTER prepare_state mean (24 -> 625 µs).

Process updates in chunks of 256, releasing the lock and yielding the
task between chunks. The fix mostly helps other phases that compete
with the BackgroundTask for CPU and lock access: at N=10k the
end-to-end create_p50 drops by 11 ms (63.18 -> 52.04), driven by:
 - coord_builtin_table_execute  -3.58 ms/DDL
 - coord_pre_transact           -1.87 ms/DDL
 - tx_commit                    -1.56 ms/DDL
prepare_state itself gets slightly worse (the per-chunk lock dance
costs a few µs and the BackgroundTask gets to re-win the lock more
often), but the BackgroundTask no longer monopolizes either lock or
CPU for the whole walk, and the DDL pipeline catches up faster.

A future architectural fix would store the shared txns upper in one
place on StorageCollectionsImpl and skip the O(N) propagation
entirely; that's larger and touches every reader of write_frontier
on a txns-backed collection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
NOTES writeup for phase 14:
- mz_storage_collections_prepare_state_phase_seconds attributed the
  prepare_state slope to dropped_shard_lookup (self.collections.lock
  acquire), confirming BackgroundTask::update_write_frontiers as the
  contention source.
- The chunked unlock in update_write_frontiers ended up being a CPU
  yield win, not the lock-wait win the source comment suggested:
  prepare_state's own contention got slightly worse (lock barging
  by the BackgroundTask), but other phases that compete with it for
  CPU caught up much faster.
- Net same-run effect at N=10k: create_p50 63.18 -> 52.04 (-11 ms),
  coord_inner_total 37.59 -> 29.52 (-8 ms).
- prepare_state slope is not flat; a future architectural pass needs
  to store the shared txns upper in one field rather than fanning it
  out to every txns-backed collection's write_frontier on every tick.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The BackgroundTask::run txns-upper branch used to walk every
txns-backed user collection on each tick of the txns shard upper and
call update_write_frontiers with N entries. The work was O(N) under
the global `collections` mutex and was the slope owner for
prepare_state on the DDL hot path: every txns advance blocked every
DDL caller behind it.

Phase 14 chunked the lock-hold inside update_write_frontiers but kept
the O(N) work on the tick path; this iteration removes it. The latest
observed txns upper now lives in a single shared field
StorageCollectionsImpl::txns_upper. The txns-upper branch just
publishes to that field (O(1)) and reissues the upper future. Readers
that need the current write frontier of a txns-backed collection
(collections_frontiers, set_read_policies_inner, alter_table_desc) go
through effective_write_frontier(), which consults the shared field
for txns-backed collections; the per-collection write_frontier
remains source of truth for non-txns collections.

Persist still needs the per-collection implied capability (since) to
advance so it can compact, so BackgroundTask::run now runs a 1Hz
periodic sweep that walks txns_shards and calls
update_write_frontiers exactly as the old per-tick path did. The work
per sweep is unchanged from before; the change is the frequency
(1Hz vs the ~40Hz observed under DDL load), and the resulting drop in
how often the lock is contended with prepare_state.
Adds two Prometheus counters to the BackgroundTask so we can measure
how often the txns shard upper actually advances vs how often the
periodic since-downgrade sweep runs.

- mz_storage_collections_txns_upper_advances_total: incremented
  on every observed upper advance (the old O(N) per-tick branch,
  now O(1) after phase 15).
- mz_storage_collections_txns_since_sweeps_total: incremented on
  every periodic sweep that fans the shared upper out to per-
  collection implied capabilities.

The ratio between these two is the multiplier we coalesce on the DDL
hot path: under phase-15 it should be roughly (advances/s) : 1 Hz.
Lets the next investigation iteration distinguish "txns shard genuinely
commits at rate X" from "BackgroundTask was spinning for some other
reason".
Adds the iteration-2 writeup answering the phase-15 follow-up: why
the BackgroundTask's txns-upper branch was firing at ~40 Hz under
DDL load.

Result, measured with the counter added in cdb8484:

- At idle, the txns shard upper advances at 1.0 Hz. This is the
  Coordinator's advance_timelines_interval (default_timestamp_interval
  = 1000 ms) firing a periodic group commit that lands one append
  against the txns shard. Not a bug.

- Under bench DDL load (100 CREATE+DROP back-to-back), the upper
  advances at 48 Hz = 4.02 advances per CREATE+DROP pair. The four
  commits per pair are: register, audit-append (CREATE),
  forget, audit-append (DROP). They are structural to the
  catalog/storage-controller split — audit goes through the
  coord group-commit path, register/forget go through the
  TxnsTableWorker, and they don't share a transaction.

No fix shipped. Phase 15's consumer-side coalescing already
removed the per-advance O(N) work. Killing the 4x producer-side
multiplier would require batching register/forget with the
audit append into a single txns commit across the adapter ↔
storage-controller boundary; that's a larger refactor than the
remaining slope justifies (already +0.42 ms/+1k tables, 10k→25k).

The writeup also flags the boundary observation: both the
consumer-side fanout (fixed) and the producer-side multiplier
(unfixed) are different symptoms of the same shape — layers
above txn-wal treat its commit API as unmetered RPC. A future
iteration could explore handing a single Txn through the catalog
transact path if/when CRDB consensus pressure becomes the
bottleneck.
Phase 15 reported a residual slope of +0.42 ms / +1k tables across
N=10k → 25k from a 100-rep sweep and called it "flat enough". The
500-rep bench with 30 warmup reps discarded shows the slope is
actually +1.296 ms / +1k tables [95% CI +1.187, +1.434] for
create_p50 and +1.108 [+1.028, +1.194] for drop_p50 — real,
3.1× steeper than phase 15 reported, and super-linear (accelerates
from +0.31 ms/+1k between N=10k → 15k to +2.31 ms/+1k between
N=20k → 25k).

Attribution (per-call mean diffed over the measurement window)
puts the slope owner below the catalog-transact layer: persist's
consensus_cas on user_data shards grows from 24 ms to 119 ms per
call, with the same pattern in txns-shard consensus_scan and
catalog-shard consensus_truncate. Two CAS-bound calls hit the
DDL hot path: open_data_handles_concurrent (16 → 24 ms) during
CREATE TABLE, and tx_commit (2.8 → 5.6 ms) for the catalog commit.

This is structural to the persist-on-CRDB layer, not a
one-file fix at the adapter / coordinator / storage-collections
boundary. We deliberately do not ship a fix this iteration. If a
future iteration wants to reduce per-DDL persist cost, the
highest-leverage change is killing the 4× producer-side
multiplier called out in iteration 2 of phase 16 — batch
register / forget / audit into a single txns commit so each DDL
pays one CRDB consensus round trip instead of four.

Bench harness, analyzer, and attribution scripts live in
/home/ubuntu/envd-ddl-investigation/ (bench_phase17.py,
analyze_phase17.py, attribute_phase17.py). Results in
results_phase17/ and metrics_phase17/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rerunning the phase-17 N=10k → N=25k bench against `mz-bogo-consensus`
(in-memory consensus backend) shows that ~53% of the +1.296 ms/+1k
create-slope reproduces with bogo (+0.691 ms/+1k [95% CI +0.605,
+0.738]) despite per-CAS latency staying flat (0.93 → 0.96 ms on
`consensus_cas`/`user_data`, growth 1.03×).

Slope splits roughly half / half: ~0.69 ms/+1k is coordinator-side
work that scales with catalog size, dominated by `transact_inner`
(+0.258 ms/+1k) and `prepare_state` (+0.200 ms/+1k); the remaining
~0.6 ms/+1k is CRDB-specific per-CAS latency growth on the
`consensus` table. Phase 17's "structural to persist-on-CRDB" is
half right; the other half is fixable in coordinator code.

Suggested iteration 5: finer breakdown inside `transact_inner` to
localize an O(N) walk, and an additional probe inside
`prepare_state` outside the already-attributed sub-phases.
The "CRDB-side" half of the phase-18 split is not in CRDB. It is the
persist-client CRDB connection pool (default max_size = 50) saturating
under N-shard background CAS load.

EXPLAIN ANALYZE on the actual CAS query against a populated consensus
table shows ~2 ms execution, 2/2 MVCC seek/step, flat with shard size
(2 rows vs 13 689 rows) and table size (224 939 rows at N=10k).

Per-call connpool acquire mean over the phase-17 measurement window:
14 ms @ N=10k -> 97 ms @ N=25k, 91 % of consensus_cas latency at the
high end. A pool=500 rebuild at N=10k drops user_data CAS from 16.4 ms
to 2.0 ms per call. DDL-level create_p50 unchanged at N=10k because the
on-path catalog/txns CAS calls aren't pool-bound at this scale.

No fix shipped — bumping the pool default needs a slope re-run plus an
init-order audit so the dyncfg can take effect without an envd restart.
Phase 19 measured a single point (N=10k, steady state, user_data
CAS mean 16 -> 2 ms at pool=500) and concluded the connection-pool
default bump would flatten the +0.61 ms/+1k 'CRDB-side' half of
the create-slope. Phase 20 re-runs the full phase-17 4-scale sweep
(N=10k -> 25k, 30 warmup + 500 measure reps, ascending pad in a
single optimized envd run) with the cfg.rs default bumped to 500
and a fresh rebuild.

Headline numbers:

  create_p50 slope: phase17 +1.296 ms/+1k [95 % CI +1.187, +1.434]
                    phase20 +1.487 ms/+1k [95 % CI +1.391, +1.624]
                    (CIs do not overlap)
  drop_p50   slope: phase17 +1.108 ms/+1k [95 % CI +1.028, +1.194]
                    phase20 +1.185 ms/+1k [95 % CI +1.117, +1.257]

The pool *did* bind to size=500 at every snapshot, and per-call
connpool_acquire mean dropped from 14.2 -> 2.4 ms at N=10k and
from 96.7 -> 65.2 ms at N=25k, matching phase 19's microbench
direction. But:

- N=25k create_p50 is unchanged (70.39 -> 69.85, within CI).
- The non-pool portion of user_data CAS grew from 16.0 -> 67.9 ms
  at N=25k. Lifting the pool throttle just shifted the queue to
  whatever sits behind the pool (CRDB connection slots, SQL CPU,
  or tokio_postgres scheduling).
- Total user_data CAS at N=25k went from 112.7 -> 133.1 ms, an
  18 % regression at the high end.

Phase 19's 8x CAS speedup is real but measured on idle
steady-state at N=10k; it does not generalize to the DDL-burst
regime with ~1.6 k concurrent CAS/sec.

Decision: defer the default bump. The level win at N=10k–N=20k is
3–10 ms, but N=25k is unchanged and the slope is statistically
worse. cfg.rs default reverted to 50; no code change shipped.

Across iterations 14–20 the DDL slope picture is now: phase 15
erased a +0.6 ms/+1k contributor; phases 18 (transact_inner +
prepare_state, +0.4 ms/+1k combined, bogo-visible) and 20 (pool
plus behind-pool, entangled) are the remaining medium-sized
contributors. No single big-bang fix remains; marginal return is
decreasing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@aljoscha aljoscha force-pushed the envd-ddl-scalability branch from b8d836b to d0a4f4b Compare May 21, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant