DNM: envd scalability work by aljoscha · Pull Request #36634 · MaterializeInc/materialize

aljoscha · 2026-05-20T15:35:39Z

For running spec sheet

The existing scenarios scale cluster size or envd CPU cores -- nothing measures how adapter/envd latency moves as the catalog itself grows. Add two scenarios under a new `envd_scalability` group that fix the measurement cluster and vary the number of catalog objects. `envd_scalability_tables` puts N empty tables in the catalog -- pure catalog/adapter pressure, no controller load. `envd_scalability_mvs` does N materialized views over a single 1-row base table -- same catalog footprint, plus controller load proportional to N. The MV scenario shards across single-replica pad clusters at 10000 MVs per cluster (so 100k MVs spans 10 clusters), since one cluster can't reasonably host that many dataflows. For each N in {1, 10, 100, 1k, 3k, 5k, 10k, 20k, 30k, 50k, 100k} we run 10 reps each of `CREATE TABLE` (DDL through the coordinator) and `SELECT * FROM <1-row table>` (a simple peek on a fixed 100cc cluster). The catalog is built incrementally across size points, so going from N=k to the next size point only adds (next - k) objects -- otherwise we'd pay an O(sizes * N) build cost. The size list is overridable via `--envd-scalability-sizes` for scaffolding runs. Results land in a third CSV (`*.envd_scalability.csv`) reusing the cluster CSV schema; `mode='envd_scalability'` distinguishes the rows. Test analytics rides on the existing `cluster_spec_sheet_result` table -- no schema change needed. The analyzer plots `time_ms` vs N per (scenario, category, test_name). This is going to be long-running, especially the MV scenario where each create exercises the controller -- expect hours for the full size range. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add two new scenarios -- cluster_object_limits_indexes and cluster_object_limits_mvs -- that find, per cluster size, the maximum number of idle materializations one cluster can keep fresh. The materializations are derived from a one-row, never-updated base table so the only work the cluster has to do is keep advancing each materialization's write_frontier in step with the upstream table. Once the cluster can't keep up, freshness collapses; the driver records the largest N at which `max(local_lag) < 2s` was still achievable, with the unhealthy data point recorded too so the cliff is visible. Staging-only (rejects --target=cloud-production), to avoid burning production resources on long object-limit searches.

…lability default at 50k When a materialization stalls completely (write_frontier never advances past the minimum timestamp), `mz_internal.mz_materialization_lag` reports `now() - 0` = current unix time in ms (~1.78e12). Recorded as-is this crushes every healthy data point to ~0 on the plot. Cap the recorded value at 10x the healthy threshold (= 20 s), preserve the underlying truth via the `healthy` column, and label the plot to make the cap and healthy threshold explicit. Also drop 100_000 from the envd_scalability default size list: 50_000 is a more sensible default ceiling for staging. The full size list is still override-able via --envd-scalability-sizes for ad-hoc runs.

…tion The release-qualification pipeline already runs three cluster-spec-sheet groups (cluster_compute on production, source_ingestion on production, environmentd on staging). Add two more groups -- envd_scalability and cluster_object_limits -- both running against staging, since both push the catalog / cluster to limits we don't want to exercise on production.

The three "envd / cluster" groups in the cluster-spec-sheet were named inconsistently. Settle on the three concept names the cluster-spec-sheet effort uses verbally: environmentd -> envd_qps_scalability (QPS vs envd CPU) envd_scalability -> envd_objects_scalability (latency vs catalog N) cluster_object_limits -> cluster_object_limits (unchanged) Renames apply to: scenario constants, scenario-name string values, group keys in SCENARIO_GROUPS, class names, the run/analyze function names, the --envd-scalability-sizes CLI flag, the result CSV suffix, and the `mode` field written into CSV rows. The pre-existing QPS scenarios keep their individual `*_envd_strong_scaling` names since only the group is renamed. Also updates the release-qualification pipeline step ids/args and the README to match.

…w start When debugging cluster-spec-sheet runs on staging it's hard to tell which environment we're actually talking to and whether the system parameter defaults we expect (lifted via LaunchDarkly or similar) are actually applied. Add a one-shot diagnostic right after target.initialize() that prints mz_environment_id() and SHOWs the limits the test depends on (max_tables, max_materialized_views, max_objects_per_schema, max_clusters, max_credit_consumption_rate, memory_limiter_interval). Best-effort: any probe error is logged and swallowed so a transient failure does not abort the workflow.

psycopg3's execute() requires a LiteralString, so the f-string SHOW query tripped pyright in CI. Compose the statement with psycopg.sql.SQL/Identifier instead, matching the pattern already used in test/orchestratord/mzcompose.py.

A staging run of `envd_objects_scalability_mvs` (release-qualification build 1219) aborted at N≈19800 with: Retryable error: consuming input failed: SSL error: unexpected eof while reading, reconnecting... psycopg.errors.InternalError_: materialized view "materialize.pad_schema.pad_mv_19805" already exists The TLS connection dropped mid-statement; envd had already committed the CREATE but the response was lost. ConnectionHandler.retryable reconnects and replays the same statement, which then fails with "already exists". Use ``CREATE ... IF NOT EXISTS`` for every CREATE issued via _bulk_run so the retry is a no-op. Affects the bulk-creation paths in both envd_objects_scalability scenarios (tables, MVs) and both cluster_object_limits scenarios (indexed views, MVs). Add a docstring on _bulk_run spelling out the idempotency requirement so future CREATEs don't reintroduce the hazard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 50k scale point pushes a single Envd Objects Scalability run past the 13-hour mark on staging — adapter latency degrades so much by then that each measurement repetition takes several seconds, and the catalog build itself runs at <1/s. 30k is where the interesting signal already lives. Drop 50k from the default list; ad-hoc runs that want it can still pass --envd-objects-scalability-sizes explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cluster_object_limits N-list defaults to a +1k linear step past N=1000, which is too coarse: a run on staging showed the cliff sits in (1000, 2000] for cluster_object_limits_indexes across every cluster size 100cc..1600cc, and we can't tell from that whether the limit is 1100 or 1900. After the coarse N-walk hits its first unhealthy point, bisect the (last_healthy, first_unhealthy) interval --cluster-object-limits-bisect- steps times (default 4) and probe each midpoint. The bisection step adds or drops objects in place — never rebuilds the catalog — so the cost is only ~bisect_steps extra hydrate-and-probe rounds per cluster size. With the default 4 steps, the cliff narrows to ±~60 objects. Adds: - `remove_objects(target_n)` symmetric to `add_objects(target_n)` on both ClusterObjectLimitsScenario subclasses. Indexes scenario drops via DROP VIEW ... CASCADE (cascades to the default index); MVs scenario drops via DROP MATERIALIZED VIEW. - `--cluster-object-limits-bisect-steps` CLI flag plumbed through to `run_scenario_cluster_object_limits`. - Bisection block in the per-cluster-size loop that calls add+remove (one is a no-op) and records each probe under the same CSV schema, so the existing freshness-lag-vs-N plot just gets denser near the cliff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

If CREATE CLUSTER fails for a cluster_object_limits size — because the target region doesn't expose that replica size, or because allocating the cluster would exceed max_credit_consumption_rate — today the scenario either aborts with a traceback or (when the cluster is created but then can't actually keep up) reports a confusing "unhealthy at N=100" data point. Catch psycopg.errors.DatabaseError around the CREATE CLUSTER, log a clear "size unavailable" line (with the underlying error class + message), and `continue` to the next cluster size. OperationalError is re-raised so that genuine connection failures (which run_query's retry loop has already given up on) aren't silently masked as a size problem. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Standalone Python harness that connects to a running environmentd, pads the catalog with N objects of a chosen type (tables / views / mvs / indexes), and times CREATE / DROP / ALTER / RENAME of various object types at each scale point. Captures trace IDs from emit_trace_id_notice so spans can be fetched from Tempo afterwards. This is intended for profiling iteration, not CI tracking; the canonical scaling numbers continue to live in test/cluster-spec-sheet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Replace CREATE CLUSTER IF NOT EXISTS (unsupported) with drop-then-create in both pad and measurement state setup. - Default --pad-cluster-size and --meas-cluster-size to scale=1,workers=1 (the smallest size accepted by a local envd; the 50cc cloud size only exists in cloud sizing maps). - Add summarize_traces.py: bulk-fetch traces from Tempo and aggregate self-time per span name across a sampled (padding, scale, op) cell. - NOTES.md: log two O(n) suspects identified by source-reading before any traces were captured: 1. Transaction::allocate_oids walks all databases/schemas/roles/ items/introspection_sources on every OID allocation — the comment at the function literally says "If DDL starts slowing down, this is a good place to try and optimize." 2. Coordinator::validate_resource_limits does 5+ full walks of entry_by_id per CREATE (user_tables / user_sources / user_sinks / user_materialized_views / user_connections) — each is a filter over the whole catalog with no per-type bucket. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tables-padding audit at N in {0,500,2000,5000}: all CREATE/DROP/ALTER/ RENAME ops scale ~4 μs/object regardless of op, which says the bottleneck is shared (catalog/coordinator) infrastructure, not op-specific. Top per-N growers from the trace summary: - storage::create_collections: +12.6 ms / 5000 = 2.5 μs/obj (CREATE TABLE only) - snapshot + transaction + consolidate (catalog durable): +7 ms / 5000 = 1.4 μs/obj (all ops) - PersistTableWriteCmd::Append: +1.7-1.9 ms / 5000 (all ops) - apply_catalog_implications_inner: ~1.1 ms appears at high N - apply_updates: ~0.5-0.7 ms growth Adds a third O(N) suspect to NOTES.md: TableTransaction::insert/update both call for_values to scan all rows on every mutation — every catalog durable txn insert walks the whole table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

views: half the slope of tables padding (+8 ms vs +23 ms at N=5000). Tells us the per-N cost stacks two components: - ~half scales with storage-collection count (paid only by ops that create/drop storage collections) - ~half scales with catalog-entry count (paid by all ops) mvs: super-linear, possibly quadratic. With 2000 MVs sharing a common base table, CREATE TABLE jumps from 37 ms (N=0) to 2095 ms (N=2000) — 14x slowdown for 4x N. Trace for the 2095 ms CREATE TABLE attributes 1.65 s of self-time to storage::create_collections, none of which is in any instrumented sub-span. Source-reading points at update_read_capabilities_inner walking a MutableAntichain that accumulates one entry per MV holding a read hold on the shared dependency, but we need targeted instrumentation or a CPU profile to confirm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add #[instrument] / info_span! wrappers inside storage_collections::create_collections_for_bootstrap to break down the self-time that currently lands in storage::create_collections without any sub-span: - ccfb::open_persist_client: persist client open - ccfb::open_data_handles_concurrent: buffer_unordered open_data_handles - ccfb::install_collection_states: main per-collection install loop - ccfb::synchronize_finalized_shards: post-loop finalize sync - install_collection_dependency_read_holds_inner - update_read_capabilities_inner - acquire_read_holds - append_shard_mappings (storage-controller side) Lets us pinpoint which sub-phase is responsible for the super-linear blow-up observed when N materialized views share a base table (CREATE TABLE goes from 37 ms at N=0 to 2 s at N=2000 mvs). Also bump Tempo's max_traces_per_user to 200k. Default 10k was rejecting trace exports from envd during these audits, dropping the very traces we need to analyse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adding sub-spans inside open_data_handles (odh::upgrade_version, odh::open_critical_handle, odh::open_write_handle, odh::fetch_recent_upper) shows that the super-linear blow-up under MV-shared-dependency padding is entirely in persist's per-shard state initialization path: at N=1000 MVs, for one CREATE TABLE: ccfb::open_data_handles_concurrent: 119 ms odh::upgrade_version: 41 ms (persist make_machine + cas) odh::open_critical_handle: 33 ms (persist make_machine) odh::fetch_recent_upper: 7 ms odh::open_write_handle: <1 ms Both upgrade_version and open_critical_handle bottom out in StateCache::get -> Applier::new -> maybe_init_shard, which hits CockroachDB via the persist consensus table. Per-shard queries are indexed by primary key, but observed cost grows with the total number of existing shards in CRDB. NOTES.md records the chain of suspicion and concrete next steps for verification (samply, raw CRDB timing, contention isolation). This is persist+CRDB territory, separable from the catalog-side O(N) fixes already proposed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Coordinator::validate_resource_limits runs before every DDL and was making 5+ full walks of state.entry_by_id to count user items by type (tables, sources, sinks, materialized views, secrets, connections by sub-kind). At N catalog items each walk is O(N), so a DDL with N=5000 items burned ~25k OrdMap iterations of dead work. Add three derived fields to CatalogState, maintained alongside entry_by_id in insert_entry / drop_item: user_item_counts: imbl::OrdMap<CatalogItemType, usize> user_source_shard_count: usize (sum of user_controllable_persist_shard_count) user_connection_counts: imbl::OrdMap<UserConnectionKind, usize> UserConnectionKind mirrors the ConnectionDetails variants that the limits care about (Kafka/Postgres/MySql/SqlServer/AwsPrivatelink) with an Other bucket for the rest. All fields use imbl::OrdMap to keep CatalogState's cheap Cow-clone semantics. Bootstrap, normal apply, and the retract+addition update path all go through insert_entry / drop_item, so the buckets stay in sync without extra wiring. Existing iterator methods (user_tables(), ...) are left in place for other callers. Rewrites the connection sub-type loop and four .count() calls in validate_resource_limits to use the new O(log N) getters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`Transaction::allocate_oids` previously walked every row in the five OID-bearing tables (databases, schemas, roles, items, introspection_sources) on every single OID allocation, paying O(N) in the size of the catalog. At N=5000 items this is ~50 μs of pure compute per DDL — visible in DDL latency traces, with the original code's own comment flagging this exact spot as the place to optimize once DDL starts slowing down. Cache the set of OIDs from the *initial* snapshot in a new `Transaction::initial_oids: BTreeSet<u32>`, populated once in `Transaction::new`. On each `allocate_oids` call, fold in the pending changes for those five tables by walking only their `pending` BTreeMaps (small in practice), computing per-key OID adds and removes relative to the initial value. The integer scan loop then checks membership in O(log N) per probe instead of O(N) per call. `initial_oids` is not serialized in `current_snapshot`; it is rebuilt when a transaction is restored via `Transaction::new`, since the snapshot already reflects the merged initial+pending state at the restore point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`TableTransaction::insert` previously called `for_values` for every insertion, walking every initial+pending row of the table to (a) check for a duplicate key and (b) run the optional uniqueness predicate. At N existing rows this made each insert O(N), and was responsible for a visible slice of the per-object cost on the DDL hot path observed under envd-ddl-scalability. Replace the duplicate-key scan with `self.get(&k).is_some()`, which consolidates `initial`+`pending` for one key in O(log N + P_k). When no uniqueness predicate is registered on the table (the case for the majority of TableTransaction instances — 14 of the 22 constructions use `TableTransaction::new`), skip the for_values walk entirely. When one *is* registered, fall back to the existing full scan, with a code comment pointing at the eventual fix (a uniqueness-key extractor + side index). Behaviour is preserved: the same inputs produce the same `DurableCatalogError::DuplicateKey` / `UniquenessViolation` errors. The only externally visible nuance is that when an insert would have hit both conditions simultaneously, we now consistently surface `DuplicateKey` first — the old code's precedence was iteration-order dependent. The `soft_assert_no_log!(self.verify().is_ok())` at the end of `insert` is itself O(N) (and O(N^2) when the value type doesn't implement `HAS_UNIQUE_NAME`), but it's gated behind `soft_assertions_enabled()` and so is debug/test only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@n

Tables-padding: catalog-side O(N) slope eliminated. CREATE TABLE / DROP TABLE / ALTER / RENAME / CREATE VIEW / DROP VIEW all show ~0 ms Delta from N=0 to N=5000 (previously +17-23 ms each). MVs-padding: catalog-side cost reduced ~2-3x (CREATE TABLE @n=2000: 2095 -> 851 ms) but super-linear remainder is intact, matching the prediction that persist's per-shard init path dominates when many storage collections share dependencies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The first post-fix run compared a cold (freshly --reset) envd against a warm pre-fix run, which made the N=0 baseline look 30 ms higher and the slope look flat. Re-ran on a properly-warm envd: pre/post slopes are essentially identical (~+20 ms from N=0 to N=5000 across all ops). The three fixes were correct but their combined contribution to the per-DDL slope is below measurement noise. So the named O(N) loops (validate_resource_limits, allocate_oids, TableTransaction::insert) are NOT the dominant source of the ~4 μs/ object slope. NOTES.md now lists the remaining suspects we hadn't pinpointed (catalog durable snapshot, compare_and_append on the catalog shard, apply_catalog_implications_inner). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fresh tables-padding audit on warm post-fix envd (N=0, N=5000, reps=8) shows the dominant per-DDL O(N) cost is in catalog durable transaction machinery, not in the named loops the three earlier fixes targeted: span N=0 N=5000 Δ snapshot (catalog durable) 509μs 4.1ms +3.6ms transaction (catalog durable) 426μs 3.0ms +2.6ms consolidate (catalog durable) 449μs 2.1ms +1.7ms group_commit_apply::append_fut 5.0ms 7.6ms +2.6ms apply_catalog_implications_inner <0.5ms 1.2ms +1.2ms apply_updates 536μs 1.4ms +0.9ms snapshot + transaction + consolidate together account for ~1.5 μs/object of the slope. Code locations: with_snapshot src/catalog/src/durable/persist.rs:766 Transaction::new src/catalog/src/durable/transaction.rs:128 consolidate src/catalog/src/durable/persist.rs:706 with_snapshot rebuilds a Snapshot { databases, schemas, items, ... } of all BTreeMaps from scratch per transaction; Transaction::new then walks every row again to do proto→Rust conversion into TableTransaction.initial. Three full O(N) walks per DDL. This is exactly what the proposed DurableCatalogData (Arc<imbl::OrdMap> per table) + per-txn overlay design eliminates.

Spec for making single-statement DDL latency flat in catalog size, by replacing per-transaction Snapshot materialization with shared, indexed DurableCatalogData (Arc<imbl::OrdMap> per table) and a per-transaction overlay. Includes audit findings from the envd-ddl-scalability branch that motivate each part of the design, and a 7-step rollout plan with per-step regression signals on the audit harness.

Step 1 of the O(delta) catalog transaction design (`doc/developer/design/20260515_ddl_catalog_o_delta.md`). Today every single-statement DDL re-materialises the full catalog twice per transaction: `with_snapshot` walks the consolidated trace into a fresh proto-typed `Snapshot { databases: BTreeMap, ... }`, and `Transaction::new` walks every row of that snapshot again to do a proto->Rust conversion into per-table `BTreeMap`s on the `TableTransaction`. With N=5000 catalog items the `snapshot` and `transaction` durable spans together add ~7 ms per DDL. Introduce `DurableCatalogData`: one `Arc<imbl::OrdMap<K, V>>` per durable-catalog table, holding Rust-typed values. It is maintained incrementally by `CatalogStateInner::apply_update` — each persisted state update inserts or removes from the matching map through `Arc::make_mut`, so transactions holding older `Arc`s observe a frozen view. Rewrite `TableTransaction` to overlay-style: `base: Arc<imbl::OrdMap<K, V>>` plus the existing `pending: BTreeMap<K, Vec<TransactionUpdate<V>>>`. Reads probe `pending` then fall through to `base`; writes only touch `pending`. Constructing one is now an `Arc::clone`, no proto conversion. `PersistCatalogState::transaction()` no longer calls `with_snapshot`. It clones the shared `DurableCatalogData` (one `Arc::clone` per field) and hands it to a new `Transaction::new_from_durable_data`. The old `Transaction::new(_, Snapshot, _)` remains for `transaction_from_snapshot` and external callers; it converts the proto-typed Snapshot to a `DurableCatalogData` and delegates to the new constructor, paying the O(N) proto->Rust cost only for those paths. The trace `PersistHandle::snapshot: Vec<...>` is kept as-is — other callers (`with_trace`, `get_next_id`, `Trace::from_snapshot`, `persist_snapshot`) still depend on it. `DurableCatalogData` is maintained alongside, not in place of, the trace. Indexes, resource counts, shared OID sets, and the transaction ownership rework are subsequent steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 2 of the O(delta) catalog transaction design (`doc/developer/design/20260515_ddl_catalog_o_delta.md`). Add `DurableCatalogIndexes` to `DurableCatalogData`: - `database_by_name`, `schema_by_parent_name`, `role_by_name`, `cluster_by_name`, `replica_by_cluster_and_name`, `network_policy_by_name`: one `Arc<imbl::OrdMap<Name, Key>>` per table that has a name-based uniqueness predicate. - `item_by_namespace: Arc<imbl::OrdMap<(SchemaId, String), imbl::Vector<(ItemKey, CatalogItemType)>>>` — the items table's uniqueness predicate is composite (a Type item co-exists with a non-conflicting Func item at the same `(schema, name)`), so the bucket is a small list of candidates, walked pairwise. - `oid_owner: Arc<imbl::OrdMap<u32, OidOwner>>` covers all five OID-bearing tables (databases, schemas, roles, items, intro sources). Not wired in yet — step 4 replaces the per-txn `initial_oids` walk with this. The indexes are maintained in `CatalogStateInner::apply_update`, in lock-step with the durable maps, via `Arc::make_mut`. Plumb a per-`TableTransaction` `index_probe: Option<Box<dyn Fn(&V) -> Vec<K>>>` so `insert` and `verify_keys` can replace the O(N) `for_values` walk over `base` with an O(log N + collisions) index probe plus a small walk over `pending`. The fallback nested-loop scan stays for tables with a uniqueness predicate but no index. Wire all seven indexed tables in `Transaction::new_from_durable_data` via `Arc::clone`s of the index maps. `snapshot_to_durable_data` builds the indexes once from the converted tables — paid by callers that go through proto-`Snapshot` (`transaction_from_snapshot`, tests), not the hot `transaction()` path. Two test items (`test_persist_items`, `test_persist_ddl_detection_with_batch_allocated_ids`) hard-coded OIDs in the 20_000 range that collide with bootstrap-assigned OIDs (materialize db, public schema). The new strict `oid_owner` insert assertion correctly rejects them, so bump test OIDs to 30_000+ rather than weaken the invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 4 of the O(delta) catalog transaction design. `Transaction::new_from_durable_data` was still building a per-txn `initial_oids: BTreeSet<u32>` by walking every value in the five OID-bearing tables (databases, schemas, roles, items, intro sources). That's residual O(N) work in the otherwise O(delta) hot path. Replace `initial_oids: BTreeSet<u32>` with `initial_oid_owner: Arc<imbl::OrdMap<u32, OidOwner>>`, populated by `Arc::clone`ing the shared `oid_owner` index that step 2 added. The probe in `allocate_oids::is_allocated` becomes `initial_oid_owner.contains_key` — O(log N) point lookup instead of O(1) on a freshly-built set, but the build cost drops from O(N) to O(1) per transaction. `pending_added` / `pending_removed` still cover this transaction's in-flight allocations / removals; only the initial-state view changes representation. No behaviour change for callers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… stmts Step 5 of the O(delta) catalog transaction design. Two coupled changes. (A) Drop the `'a` borrow on `Transaction`. `Transaction<'a>` held `&'a mut dyn DurableCatalogState` only to: - read `is_bootstrap_complete()` in two debug asserts, - read `is_savepoint()` for `Transaction::is_savepoint()`, - hand the reference back out of `into_parts` so `commit_internal` could call `commit_transaction`. Capture the two booleans at construction. Make `commit` and `commit_internal` take `storage: &mut dyn DurableCatalogState` as an argument. `Transaction` becomes owned (no lifetime), so it can be moved into `TransactionOps::DDL` and held across `await` points without a borrow on the catalog storage. `Transaction::new` / `new_from_durable_data` now take the two booleans directly. The new `DurableCatalogState::transaction_from_durable_data(data)` method mirrors `transaction_from_snapshot(snapshot)` but skips the O(N) proto→Rust conversion. `StorageTxn for Transaction` (no lifetime) follows trivially. `Transaction::current_durable_data(&self)` is the new proto-conversion-free counterpart of `current_snapshot()`. Each `TableTransaction` folds its `pending` into its `base` (`Arc::clone` when pending is empty), and the indexes are rebuilt with `DurableCatalogIndexes::from_tables`. (B) Carry `DurableCatalogData` in `TransactionOps::DDL`. The dry-run loop in `transact_incremental_dry_run` previously checkpointed state as a proto `Snapshot` and called `transaction_from_snapshot(prev_snapshot)` on the next statement — an O(N) walk per statement of a multi-statement DDL transaction. Replace `TransactionOps::DDL { snapshot: Option<Snapshot> }` with `durable_data: Option<DurableCatalogData>`, and route through `transaction_from_durable_data` / `current_durable_data`. Each statement is now O(|pending|) instead of O(catalog). The proto `Snapshot` type and the trait methods that use it remain; they're public surface used by `catalog/tests/*` and any external consumer of the `DurableCatalogState` trait. Mechanical sweep through `src/adapter/src/catalog/{open,migrate, open/builtin_schema_migration}.rs`, `src/adapter/src/catalog/transact.rs`, `src/adapter/src/coord/{ddl,command_handler,sequencer/inner}.rs` to drop the `'_` lifetime from `Transaction<'_>` everywhere and to thread `&mut **storage` (or `&mut *state` in catalog tests) into `commit` / `commit_internal` calls. `DurableCatalogState::allocate_id` was a default method that called `txn.commit_internal(commit_ts)` against `&mut self`, which can't be unsized to `&mut dyn DurableCatalogState` without making the method non-object-safe. Rewrite the body to use `into_parts` + `commit_transaction` directly. `allocate_id` only writes a single `id_allocator` row, so skipping `commit_internal`'s in-memory consolidation pass is a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ications Step 6 of the O(delta) catalog transaction design. The "post-fix high-N trace ranking" in test/envd-ddl-scalability/NOTES.md identified `apply_catalog_implications_inner` as a ~+1.2 ms grower at N=5000, shared by every DDL op type. The culprit is here: The per-timeline loop in `gather_timeline_associations` walked every read hold in `read_holds.storage_ids()` / `.compute_ids()` and asked whether each was a member of the (typically tiny) drop set — effectively O(N) per timeline. With ~5000 storage holds in the default `EpochMilliseconds` timeline, every DDL paid this scan, regardless of whether it dropped anything. Two follow-on costs compounded it: - `compute_gids_to_drop` was a `Vec`, so `.contains((id, gid))` was O(M) per element — O(N·M) inside the loop. - The emptiness check then rebuilt `read_holds.id_bundle()` and called `.difference(&id_bundle).is_empty()` — another full O(N) walk per timeline. Fix: probe the small drop sets into the BTreeMap-backed `storage_holds` / `compute_holds`, making each scan O(|drop_set| · log N). Convert `compute_gids_to_drop` to a `BTreeSet`. Replace the `difference` emptiness check with a length comparison: id_bundle is built from elements that exist in read_holds, so `|read_holds \ id_bundle| == 0` iff sizes match — O(1) given map sizes. Whole-cluster drops are rare and not naturally indexed; keep that case's linear scan but gate it on `clusters_to_drop_set.is_empty()`. `apply_updates` was audited end-to-end; no clear single O(N) loop in the per-update handlers — the remaining ~+0.9 ms in that span is likely tracing overhead and small allocation pressure across the debug-instrumented apply path, not a fixable hot loop. Left for follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cute is now flat Validation bench at N=5k/10k/15k after 5d2d138 (remove O(n) table advancement loop from group_commit). create_p50 dropped by 5.6 / 9.5 / 14.3 ms across the three scales; per-+5k slope reduction is 38% / 29%. coord_builtin_table_execute itself is now essentially flat: mean per call goes 3.80 -> 3.91 -> 4.59 ms (was 8.26 -> 12.40 -> 16.85 pre-fix). Slope per +5k inside the timer drops from +4.14 / +4.45 ms to +0.11 / +0.68 ms. The remaining ~+11 ms-per-+5k headline slope is now spread across transact_inner, tx_commit, and apply_catalog_implications. Largest single absolute cost per DDL is apply_catalog_implications at ~11.6 ms (flat). That's the next investigation target: a sub-phase split to attribute where the constant 25%-of-DDL cost lives. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds a new HistogramVec `mz_apply_catalog_implications_phase_seconds` labeled by phase. apply_catalog_implications had only a single outer timer (~11.6 ms per call regardless of N) so we could not tell what fraction of that constant cost lives in each region. Phases captured: * absorb_updates — the implication batching loop at the top of apply_catalog_implications * inner_total — the whole call into apply_catalog_implications_inner (split below) * inner_item_loop — the `for (catalog_id, implication) in implications` loop that walks per-item implications * inner_cluster_loops — the cluster + cluster_replica command loops * inner_controller_setup — the post-loop calls into the controllers: create_source_collections, create_table_collections, initialize_storage_collections, vpc_endpoints, alter_* * inner_dependency_scan — active-sink/peek/copy cleanup + timeline association rebuilds * inner_finalize — the "no error returns allowed" async block: drops, retires, background secret/replication-slot cleanup Same pattern as `mz_catalog_transact_phase_seconds`: histogram cloned locally, RAII timers via `start_timer()`. No behavioral change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…cations Sub-phase split of mz_apply_catalog_implications_seconds at N=5k/10k/15k. inner_controller_setup (i.e. create_table_collections + initialize_storage_collections, etc.) is 84% of inner_total and carries ~93% of the slope. CREATE-only inner_controller_setup is 16.32 / 17.13 / 19.70 ms across scales. That's the single call into controller.storage.create_collections opening a fresh persist WriteHandle + SinceHandle for the new table shard, then a compare_and_downgrade_since on it. DROP-only inner_finalize is ~3.6 ms — drop_tables -> txn-wal append. Mostly flat. Everything else (absorb_updates, item loop, cluster loops, dependency scan) is microseconds in this workload and will not matter until we have real user-cluster activity. Next: split create_table_collections / create_collections into phases (write-ts grab, advance_upper, open handles, downgrade-since, install collection state) to attribute the 20 ms CREATE cost. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add four sub-phase labels to mz_apply_catalog_implications_phase_seconds: - create_table_write_ts (get_local_write_ts) - create_table_advance_upper (catalog.advance_upper) - create_table_storage_create_collections (controller.storage.create_collections) - create_table_apply_local_write (apply_local_write) inner_controller_setup is 16+ ms per CREATE at N=15k and carries the dominant slope inside apply_catalog_implications. This split tells us which of the four steps is responsible. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Sub-phase split of create_table_collections at N=5k/10k/15k. Inside apply_catalog_implications's inner_controller_setup (16-22 ms per CREATE), the four named steps break down as: - storage.create_collections: 8.16 -> 8.98 -> 11.57 ms (slope +0.82 / +2.59) - get_local_write_ts: 3.27 -> 3.62 -> 3.61 ms (flat ~3.6) - apply_local_write: 2.88 -> 3.12 -> 3.29 ms (flat ~3) - catalog.advance_upper: 0.56 -> 0.61 -> 0.71 ms (small slope) storage.create_collections is both the dominant absolute cost AND the dominant slope owner (74% of inner_controller_setup slope). Next target: split storage_collections::create_collections_for_bootstrap into open_data_handles, compare_and_downgrade_since, and the collection-install loop. Existing info_span!s mark the boundaries. Secondary finding: CREATE TABLE makes two get_local_write_ts calls (one in catalog_transact_inner, one in create_table_collections), each ~3 ms. Worth a design review of whether the second is required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…strap Split storage.create_collections (the slope owner inside CREATE TABLE controller setup, identified in commit 7491526) into two layers of sub-phases: storage_collections.create_collections_for_bootstrap: - validate_and_enrich - open_persist_client - open_data_handles_concurrent (buffer_unordered 50) - sort - install_collection_states (under collections mutex) - synchronize_finalized_shards storage_controller.create_collections_for_bootstrap: - storage_collections_call (inner call to the above) - validate_and_enrich - open_persist_client - open_data_handles_concurrent (the second buffer_unordered, controller's own per-collection write handle) - register_loop (per-collection for loop with acquire_read_holds) - init_source_statistics - table_register (persist_table_worker.register batched call) - append_shard_mappings - run_to_execute Two new HistogramVecs registered: - mz_storage_collections_create_collections_phase_seconds - mz_storage_controller_create_collections_phase_seconds Both keyed on phase label. Will use to attribute the 11.57 ms / +2.59 ms per +5k slope inside controller.storage.create_collections at N=15k.

… is in Catalog::transact Phase 7 ran the new mz_storage_collections / mz_storage_controller create_collections_phase_seconds histograms at N=5k/10k/15k. Findings: 1. storage.create_collections is flat at ~9 ms per CREATE TABLE across all three scales. The phase 6 measurement (11.57 ms at N=15k with +2.59 ms per +5k slope) was run-to-run variance on a single data point, not a real slope. 2. Inside storage_collections::create_collections_for_bootstrap the only sub-phase with a visible slope is install_collection_states (0.19 -> 0.53 -> 1.54 ms). That's the post-stream loop running under the collections mutex; the work is BTreeMap inserts plus channel sends. Probably not worth fixing at current N. 3. open_data_handles_concurrent (the buffer_unordered stream that opens the SinceHandle + WriteHandle pair) is the biggest absolute cost at ~5 ms/CREATE but it's flat. The per-shard persist work doesn't grow with the number of already-registered shards. 4. apply_catalog_implications.inner_controller_setup is also flat now (9.43 -> 9.44 -> 9.72 ms). Phase 5's slope claim was either fixed between then and now, or also variance on a single N=15k point. Where the slope actually lives now: Catalog::transact, specifically transact_inner (+2.5 ms/+5k), tx_commit (+1.5 ms/+5k), op_loop (+0.8 ms/+5k), and the apply_updates family (+1.8 ms/+5k combined). coord_inner_total is +5.48 ms per +5k tables per call; doubled across CREATE+DROP that's +11 ms/+5k per DDL, which fully accounts for the observed create_p50 slope of +9-10 ms/+5k. Next iteration target: instrument tx_commit and the StateUpdate appliers inside Catalog::transact.

Adds two HistogramVecs to drill into tx_commit, which phase 7 attributed +1.51 ms / +5k slope to: mz_catalog_commit_transaction_phase_seconds{phase} - caa_fence_check - caa_encode - caa_persist_caa_inner (the inner compare_and_append_inner wrapper) - caa_persist_compare_and_append (the actual persist write_handle.CaA) - caa_since_downgrade (since handle maybe_compare_and_downgrade_since) - caa_post_sync (the sync(next_upper) call after CaA) mz_catalog_sync_phase_seconds{phase} - listen_fetch (listen.fetch_next, summed across iterations) - apply_updates (apply_updates, summed across timestamps) - consolidate (maybe_consolidate + final consolidate) sync_phase_seconds aggregates across all iterations within a single sync_inner call, so each tx_commit produces one sample per phase, not one per listen event. Will pin down whether the +1.51 ms / +5k tx_commit slope lives in the persist CaA, the post-CaA sync, or the in-memory consolidate.

Phase 8 attributed the +7 ms/+5k slope in tx_commit at N=5k -> N=10k to mz_catalog_sync_phase_seconds{phase="consolidate"}: N=5k: 0.64 ms/call x 6 calls/DDL = 3.82 ms/DDL N=10k: 1.87 ms/call x 6 calls/DDL = 11.20 ms/DDL N=15k: 2.39 ms/call x 6 calls/DDL = 14.37 ms/DDL Root cause: sync_inner ended with an unconditional consolidate(), which does O(N log N) work on the entire snapshot. For the steady-state case (one timestamp per sync, ~5-10 new entries added to a 15k-entry snapshot) this ran on every DDL commit, paying the full sort + dedup cost for a trivial delta. The doubling-threshold maybe_consolidate (added in MaterializeInc#36233) was supposed to amortize this, but it never triggered: sync_inner reset size_at_last_consolidation to None at the top of every call, so the threshold was always re-baselined against the current snapshot size. A single DDL never grows the snapshot by 2x, so maybe_consolidate inside the loop did nothing — and then the unconditional consolidate at the end paid the full O(N log N) cost. Two changes: 1. Drop the size_at_last_consolidation = None reset at the top of sync_inner. The doubling threshold is meant to amortize across the snapshot's lifetime, not per sync_inner invocation. 2. Replace the unconditional self.consolidate() at the end with self.maybe_consolidate(). Combined with the per-ts maybe_consolidate inside the loop, this keeps memory bounded at 2x the last consolidated size while making per-call cost amortized O(log N) instead of O(N log N). Verified existing tests still pass: test_persist_sync_consolidation_not_quadratic ok test_persist_sync_snapshot_stays_bounded_under_churn ok The "stays bounded under churn" test (200 renames of one DB) still passes because the persistent threshold + per-ts maybe_consolidate fires every ~log(N) steps. The "not quadratic" test still passes because total consolidations during a 100-ts sync stay well under the test's bound of 10.

Phase 8 split tx_commit into commit_transaction phases + sync phases. Found mz_catalog_sync_phase_seconds{phase="consolidate"} at: N=5k: 3.82 ms/DDL N=10k: 11.20 ms/DDL N=15k: 14.37 ms/DDL Phase 9 (post-fix in commit 00d31c5) shows consolidate flat at 0 ms across all N, tx_commit per-call flat at ~1.13 ms (was 2.92 -> 5.54). create_p50 at N=15k dropped from 48.93 to 44.51 ms (-4.42 ms); slope at 10k->15k dropped from +9.11 to +5.81 ms/+5k (35% reduction). The residual slope has moved entirely to in-memory state-apply paths (transact_inner outer, op_loop, apply_updates family). Next iteration target: profile CatalogState::apply_updates for the per-update walk that still scales with N.

Phase 9 confirmed the catalog `consolidate` slope is gone. The residual +5.81 ms/+5k slope at 10k→15k is still inside Catalog::transact_inner, specifically in `apply_updates` and family. We have no visibility into which sub-step of apply_updates owns that slope. Add two new histograms: * mz_catalog_apply_updates_phase_seconds{phase} One observation per apply_updates call, per sub-phase: - consolidate_initial (the per-call consolidate_updates) - sort_per_group (sort_updates per timestamp group) - apply_updates_inner (the kind-dispatch loop) - cleanup_notices (drop_optimizer_notices + pack) * mz_catalog_apply_update_kind_seconds{kind} One observation per StateUpdate inside apply_updates_inner, labeled by StateUpdateKind variant (item, schema, storage_collection_metadata, etc.). The 1e-5 lower bucket lets us see individual sub-microsecond updates so per-kind contributions add up correctly across hundreds of updates per DDL. Strong working hypothesis for the slope: `Arc::make_mut` on `storage_metadata` (whose inner `collection_metadata` is a non-persistent `BTreeMap<GlobalId, ShardId>`) does a full O(N) clone every time the Arc is shared between `preliminary_state` and `state` Cow's — which happens twice per DDL through the op_loop + final_apply_updates path. At N=15k that's ~5 ms of pure clone work, matching the observed slope. The next phase will run the bench and confirm or refute this from the per-kind data.

Phase 10's per-kind apply_updates instrumentation pointed at `StateUpdateKind::Item` as the dominant slope owner (+850 us per call per +10k tables, on 4 calls per DDL). The work that scales with N is inside `apply_item_update` / `insert_entry` / `drop_item`, which call `self.get_schema_mut(...)` to walk `database_by_id.get_mut(...).schemas_by_id.get_mut(...)`. `schemas_by_id` is `imbl::OrdMap<SchemaId, Schema>`, which is shared between `preliminary_state` and `state` Cow's in `transact_inner`. `get_mut` on a shared `imbl::OrdMap` path-copies the affected B-tree leaf and **clones every value in the leaf**, not just the targeted one. Schema embedded three non-persistent maps: pub items: BTreeMap<String, CatalogItemId>, pub functions: BTreeMap<String, CatalogItemId>, pub types: BTreeMap<String, CatalogItemId>, At N=15k the audit_pad schema's `items` has 15k entries. The B-tree leaf containing the materialize-database schemas almost certainly fits in one imbl chunk, so any `apply_item_update` (even mutating audit_meas, not audit_pad) leaf-copies audit_pad's Schema and clones its 15k-entry BTreeMap. That's the O(N) memcpy+tree-build per call that the phase 10 per-kind metric attributed to `item`. Switching items/functions/types to `imbl::OrdMap<String, CatalogItemId>` makes the per-leaf-clone path O(1) (refcount bump on a persistent tree root). All call sites use only operations common to both types (get / insert / remove / contains_key / is_empty / len / values / iter), so it's a drop-in. The fn pointer signature in `CatalogState::resolve` is updated to match. Phase 11 bench follow-up will validate that `item` mean per call flattens with N.

…ections Phase 10's per-kind apply_updates instrumentation showed `StateUpdateKind::StorageCollectionMetadata` had a clean +200 us/call slope at +5k tables (2 calls per DDL, ~+0.4 ms/DDL). `StorageMetadata` lives behind `Arc<StorageMetadata>` on `CatalogState`. The `preliminary_state`/`state` Cow pattern in `Catalog::transact_inner` shares this Arc, so the first `Arc::make_mut(storage_metadata)` per independently-owned `CatalogState` deep-clones its fields: pub collection_metadata: BTreeMap<GlobalId, ShardId>, pub unfinalized_shards: BTreeSet<ShardId>, With N=15k tables, `collection_metadata` has ~15k entries, and the `BTreeMap` clone is O(N). At 2 make_mut'd `CatalogState`s per DDL, that's two full clones per DDL. Switch both to imbl::OrdMap / imbl::OrdSet so the clone is O(1) (persistent tree refcount bump). All external callers use only operations common to both (.get(), .contains(), .iter()). The only local API divergence: `imbl::OrdSet::insert/remove` return `Option<T>`, not `bool` — `apply_unfinalized_shard_update` is adjusted accordingly. Same reasoning as ad197b0 (Schema.items/functions/types), applied to the storage-side analogue.

Phase 10 instrumentation (00025cb) split apply_updates into sub-phases + per-StateUpdateKind, attributing the +1.94 ms/+5k apply_updates_inner slope at 10k→15k to two non-persistent collections living behind shared imbl/Arc handles: 1. Schema.items/functions/types: BTreeMap<String, CatalogItemId> — cloned via imbl::OrdMap leaf path-copy in get_schema_mut. At N=15k the audit_pad schema's items had 15k entries; every apply_item_update clone cost O(N). 2. StorageMetadata.{collection_metadata, unfinalized_shards}: BTreeMap/BTreeSet behind Arc<StorageMetadata>. The preliminary_state/state Cow split in transact_inner forces Arc::make_mut to deep-clone these on each owned CatalogState. Phase 11 (ad197b0) and phase 12 (4b6f5d1) swap both to their imbl persistent counterparts. Results at N=15k: - item kind: 1436 -> 242 us/call (-83%) - storage_collection_metadata kind: 567 -> 5.12 us/call (-99%) - apply_updates_inner slope (10->15k): +1.94 -> +0.18 ms/DDL - create_p50 (15k): 48.93 (phase 7) -> 42.84 (phase 12), -6.09 ms NOTES.md captures the per-phase tables and writes up the generalizable pattern: inline value types stored inside an imbl::OrdMap (or behind shared Arc) silently lose the O(1) clone property to any non-persistent sub-collection field.

…ue pattern Read-only audit of every imbl::OrdMap in the catalog/controller hot paths, looking for the same "outer is persistent, inner is not" shape that drove phases 11 and 12. Findings, ranked: HIGH (same shape, same hot path, drop-in fix): - Database.{schemas_by_id, schemas_by_name}: BTreeMap inside Cluster Cluster -> wrong, Database value held in imbl::OrdMap<DatabaseId, Database>. get_schema_mut is on the every-apply_item_update path. Worth landing next. MEDIUM (workload-dependent, not exercised by the audit_pad bench): - Cluster.bound_objects: BTreeSet<CatalogItemId> grows with every object bound to that cluster. Invisible in this bench (plain tables don't bind to a cluster), but expected to slope under MV / index workloads on a single cluster. - Cluster.replica_id_by_name_ / replicas_by_id_ / log_indexes: small in practice. LOW (small or workload-specific): - Role.{vars.map, membership.map}: per-role counts are small. - SourceReferences.references: grows with one source's refs, not N. - CatalogEntry.{referenced_by, used_by}: small per entry, but the 16-entry imbl leaf clone of entry_by_id also re-clones CatalogItem (incl. optimized/physical plans for MV/Index/CT) for sibling entries; matters under MV scale, not table scale. - notices_by_dep_id: value is Vec<Arc<_>>, shallow clone is cheap. Recommended next step: land the Database BTreeMap -> imbl::OrdMap swap, then design a real MV/index scale bench before considering the MEDIUM tier. The existing mz_catalog_apply_update_kind_seconds{kind} histogram is the canonical signal for whether each tier is worth fixing.

…istent collections Phase 11+12 fixed the two non-persistent inner collections that the audit_pad bench exposed (Schema.items and StorageMetadata.collection_metadata). This commit applies the same fix to the remaining sites identified by the read-only sweep in 351ddb9: Database.{schemas_by_id, schemas_by_name}: BTreeMap -> imbl::OrdMap Cluster.{bound_objects, replica_id_by_name_, replicas_by_id_}: BTreeMap/BTreeSet -> imbl::OrdMap/imbl::OrdSet RoleMembership.map: BTreeMap -> imbl::OrdMap RoleVars.map: BTreeMap -> imbl::OrdMap SourceReferences.references: Vec -> imbl::Vector All of these live inside imbl::OrdMap<K, V> on CatalogState (or transitively inside such a value type), so the preliminary_state/state Cow split in Catalog::transact_inner forces imbl leaf path-copies to deep-clone them. The audit_pad bench doesn't exercise these paths (plain tables don't touch clusters / roles / sources), but the same shape that produced the +5 ms/DDL slopes in phases 11+12 would surface on cluster-heavy / role-heavy workloads. Two intentional non-changes: * Cluster.log_indexes stays BTreeMap<LogVariant, GlobalId>: tight API contract with the compute controller (arranged_logs: BTreeMap<...>) and bounded by ~10 log variants. * CatalogEntry.{referenced_by, used_by} stay Vec<CatalogItemId>: the mz_sql::catalog::CatalogItem trait returns &[CatalogItemId] for them, which imbl::Vector can't satisfy (no Deref<[T]>). Per-entry counts are small and the dominant per-entry clone cost during entry_by_id leaf path-copy is item: CatalogItem, not these vectors. Trait signatures in mz_sql::catalog::{CatalogDatabase, CatalogRole, CatalogCluster} change to match (&imbl::OrdMap / &imbl::OrdSet). The per-field "why imbl" comments on Schema and StorageMetadata are removed and consolidated into a single rule-block doc comment on CatalogState explaining the pattern, the two slopes that motivated it, and the rule for new fields. Two intentional holdouts are called out there as well.

…on tables The phase 13 sweep (1a8446c) switched five more inline-in-imbl::OrdMap collections off of BTreeMap/BTreeSet/Vec, but the audit_pad bench (plain CREATE/DROP TABLE) doesn't exercise the cluster, role, or source paths — so the expected outcome was "no measurable change, no regression." That's what we got. item @ N=15k: 242 -> 238 us/call. storage_collection_metadata unchanged at ~5 us/call. apply_updates_inner total at N=15k flat at 1.48 ms/DDL. Run-to-run noise dominates. The sweep is landmine-prevention for cluster-heavy / role-heavy / source-heavy workloads where the same leaf-clone pattern would surface a slope on those DDL kinds. A future scale bench targeting CREATE INDEX / MV / GRANT is the right way to actually measure those payoffs; the existing per-kind histogram is the canonical signal.

Adds mz_storage_collections_prepare_state_phase_seconds{phase} so we can attribute the prepare_state slope (currently 2.4 ms/call at N=15k user tables, growing roughly linearly) to its sub-phases. Sub-phases: - insert_add, insert_register, delete: txn mutations. - dropped_shard_lookup: self.collections.lock() acquire + the loop over dropped_mappings. - insert_unfinalized: txn.insert_unfinalized_shards. - mark_finalized: self.finalized_shards.lock() + txn.mark_shards_as_finalized. Companion to mz_catalog_transact_phase_seconds (outer) and mz_apply_catalog_implications_phase_seconds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When the txns shard upper advances, BackgroundTask::run propagates the new upper to every txns-backed user table by calling self.update_write_frontiers(...) with one entry per table. The previous implementation held self.collections (the single global mutex shared with every DDL path, including catalog prepare_state) for the full O(N) walk. At N=15k user tables that meant one txns-upper tick = ~10+ ms of held mutex, plus the BackgroundTask staying CPU-bound while it held it. Phase-level bench (mz_storage_collections_prepare_state_phase_seconds) attributed the slope to exactly this: dropped_shard_lookup (which only does self.collections.lock() + an empty for-loop in our CREATE TABLE workload) went from 1.6 µs/call at N=5k to 599 µs/call at N=10k, matching the OUTER prepare_state mean (24 -> 625 µs). Process updates in chunks of 256, releasing the lock and yielding the task between chunks. The fix mostly helps other phases that compete with the BackgroundTask for CPU and lock access: at N=10k the end-to-end create_p50 drops by 11 ms (63.18 -> 52.04), driven by: - coord_builtin_table_execute -3.58 ms/DDL - coord_pre_transact -1.87 ms/DDL - tx_commit -1.56 ms/DDL prepare_state itself gets slightly worse (the per-chunk lock dance costs a few µs and the BackgroundTask gets to re-win the lock more often), but the BackgroundTask no longer monopolizes either lock or CPU for the whole walk, and the DDL pipeline catches up faster. A future architectural fix would store the shared txns upper in one place on StorageCollectionsImpl and skip the O(N) propagation entirely; that's larger and touches every reader of write_frontier on a txns-backed collection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

NOTES writeup for phase 14: - mz_storage_collections_prepare_state_phase_seconds attributed the prepare_state slope to dropped_shard_lookup (self.collections.lock acquire), confirming BackgroundTask::update_write_frontiers as the contention source. - The chunked unlock in update_write_frontiers ended up being a CPU yield win, not the lock-wait win the source comment suggested: prepare_state's own contention got slightly worse (lock barging by the BackgroundTask), but other phases that compete with it for CPU caught up much faster. - Net same-run effect at N=10k: create_p50 63.18 -> 52.04 (-11 ms), coord_inner_total 37.59 -> 29.52 (-8 ms). - prepare_state slope is not flat; a future architectural pass needs to store the shared txns upper in one field rather than fanning it out to every txns-backed collection's write_frontier on every tick. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The BackgroundTask::run txns-upper branch used to walk every txns-backed user collection on each tick of the txns shard upper and call update_write_frontiers with N entries. The work was O(N) under the global `collections` mutex and was the slope owner for prepare_state on the DDL hot path: every txns advance blocked every DDL caller behind it. Phase 14 chunked the lock-hold inside update_write_frontiers but kept the O(N) work on the tick path; this iteration removes it. The latest observed txns upper now lives in a single shared field StorageCollectionsImpl::txns_upper. The txns-upper branch just publishes to that field (O(1)) and reissues the upper future. Readers that need the current write frontier of a txns-backed collection (collections_frontiers, set_read_policies_inner, alter_table_desc) go through effective_write_frontier(), which consults the shared field for txns-backed collections; the per-collection write_frontier remains source of truth for non-txns collections. Persist still needs the per-collection implied capability (since) to advance so it can compact, so BackgroundTask::run now runs a 1Hz periodic sweep that walks txns_shards and calls update_write_frontiers exactly as the old per-tick path did. The work per sweep is unchanged from before; the change is the frequency (1Hz vs the ~40Hz observed under DDL load), and the resulting drop in how often the lock is contended with prepare_state.

Adds two Prometheus counters to the BackgroundTask so we can measure how often the txns shard upper actually advances vs how often the periodic since-downgrade sweep runs. - mz_storage_collections_txns_upper_advances_total: incremented on every observed upper advance (the old O(N) per-tick branch, now O(1) after phase 15). - mz_storage_collections_txns_since_sweeps_total: incremented on every periodic sweep that fans the shared upper out to per- collection implied capabilities. The ratio between these two is the multiplier we coalesce on the DDL hot path: under phase-15 it should be roughly (advances/s) : 1 Hz. Lets the next investigation iteration distinguish "txns shard genuinely commits at rate X" from "BackgroundTask was spinning for some other reason".

Adds the iteration-2 writeup answering the phase-15 follow-up: why the BackgroundTask's txns-upper branch was firing at ~40 Hz under DDL load. Result, measured with the counter added in cdb8484: - At idle, the txns shard upper advances at 1.0 Hz. This is the Coordinator's advance_timelines_interval (default_timestamp_interval = 1000 ms) firing a periodic group commit that lands one append against the txns shard. Not a bug. - Under bench DDL load (100 CREATE+DROP back-to-back), the upper advances at 48 Hz = 4.02 advances per CREATE+DROP pair. The four commits per pair are: register, audit-append (CREATE), forget, audit-append (DROP). They are structural to the catalog/storage-controller split — audit goes through the coord group-commit path, register/forget go through the TxnsTableWorker, and they don't share a transaction. No fix shipped. Phase 15's consumer-side coalescing already removed the per-advance O(N) work. Killing the 4x producer-side multiplier would require batching register/forget with the audit append into a single txns commit across the adapter ↔ storage-controller boundary; that's a larger refactor than the remaining slope justifies (already +0.42 ms/+1k tables, 10k→25k). The writeup also flags the boundary observation: both the consumer-side fanout (fixed) and the producer-side multiplier (unfixed) are different symptoms of the same shape — layers above txn-wal treat its commit API as unmetered RPC. A future iteration could explore handing a single Txn through the catalog transact path if/when CRDB consensus pressure becomes the bottleneck.

Phase 15 reported a residual slope of +0.42 ms / +1k tables across N=10k → 25k from a 100-rep sweep and called it "flat enough". The 500-rep bench with 30 warmup reps discarded shows the slope is actually +1.296 ms / +1k tables [95% CI +1.187, +1.434] for create_p50 and +1.108 [+1.028, +1.194] for drop_p50 — real, 3.1× steeper than phase 15 reported, and super-linear (accelerates from +0.31 ms/+1k between N=10k → 15k to +2.31 ms/+1k between N=20k → 25k). Attribution (per-call mean diffed over the measurement window) puts the slope owner below the catalog-transact layer: persist's consensus_cas on user_data shards grows from 24 ms to 119 ms per call, with the same pattern in txns-shard consensus_scan and catalog-shard consensus_truncate. Two CAS-bound calls hit the DDL hot path: open_data_handles_concurrent (16 → 24 ms) during CREATE TABLE, and tx_commit (2.8 → 5.6 ms) for the catalog commit. This is structural to the persist-on-CRDB layer, not a one-file fix at the adapter / coordinator / storage-collections boundary. We deliberately do not ship a fix this iteration. If a future iteration wants to reduce per-DDL persist cost, the highest-leverage change is killing the 4× producer-side multiplier called out in iteration 2 of phase 16 — batch register / forget / audit into a single txns commit so each DDL pays one CRDB consensus round trip instead of four. Bench harness, analyzer, and attribution scripts live in /home/ubuntu/envd-ddl-investigation/ (bench_phase17.py, analyze_phase17.py, attribute_phase17.py). Results in results_phase17/ and metrics_phase17/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Rerunning the phase-17 N=10k → N=25k bench against `mz-bogo-consensus` (in-memory consensus backend) shows that ~53% of the +1.296 ms/+1k create-slope reproduces with bogo (+0.691 ms/+1k [95% CI +0.605, +0.738]) despite per-CAS latency staying flat (0.93 → 0.96 ms on `consensus_cas`/`user_data`, growth 1.03×). Slope splits roughly half / half: ~0.69 ms/+1k is coordinator-side work that scales with catalog size, dominated by `transact_inner` (+0.258 ms/+1k) and `prepare_state` (+0.200 ms/+1k); the remaining ~0.6 ms/+1k is CRDB-specific per-CAS latency growth on the `consensus` table. Phase 17's "structural to persist-on-CRDB" is half right; the other half is fixable in coordinator code. Suggested iteration 5: finer breakdown inside `transact_inner` to localize an O(N) walk, and an additional probe inside `prepare_state` outside the already-attributed sub-phases.

The "CRDB-side" half of the phase-18 split is not in CRDB. It is the persist-client CRDB connection pool (default max_size = 50) saturating under N-shard background CAS load. EXPLAIN ANALYZE on the actual CAS query against a populated consensus table shows ~2 ms execution, 2/2 MVCC seek/step, flat with shard size (2 rows vs 13 689 rows) and table size (224 939 rows at N=10k). Per-call connpool acquire mean over the phase-17 measurement window: 14 ms @ N=10k -> 97 ms @ N=25k, 91 % of consensus_cas latency at the high end. A pool=500 rebuild at N=10k drops user_data CAS from 16.4 ms to 2.0 ms per call. DDL-level create_p50 unchanged at N=10k because the on-path catalog/txns CAS calls aren't pool-bound at this scale. No fix shipped — bumping the pool default needs a slope re-run plus an init-order audit so the dyncfg can take effect without an envd restart.

Phase 19 measured a single point (N=10k, steady state, user_data CAS mean 16 -> 2 ms at pool=500) and concluded the connection-pool default bump would flatten the +0.61 ms/+1k 'CRDB-side' half of the create-slope. Phase 20 re-runs the full phase-17 4-scale sweep (N=10k -> 25k, 30 warmup + 500 measure reps, ascending pad in a single optimized envd run) with the cfg.rs default bumped to 500 and a fresh rebuild. Headline numbers: create_p50 slope: phase17 +1.296 ms/+1k [95 % CI +1.187, +1.434] phase20 +1.487 ms/+1k [95 % CI +1.391, +1.624] (CIs do not overlap) drop_p50 slope: phase17 +1.108 ms/+1k [95 % CI +1.028, +1.194] phase20 +1.185 ms/+1k [95 % CI +1.117, +1.257] The pool *did* bind to size=500 at every snapshot, and per-call connpool_acquire mean dropped from 14.2 -> 2.4 ms at N=10k and from 96.7 -> 65.2 ms at N=25k, matching phase 19's microbench direction. But: - N=25k create_p50 is unchanged (70.39 -> 69.85, within CI). - The non-pool portion of user_data CAS grew from 16.0 -> 67.9 ms at N=25k. Lifting the pool throttle just shifted the queue to whatever sits behind the pool (CRDB connection slots, SQL CPU, or tokio_postgres scheduling). - Total user_data CAS at N=25k went from 112.7 -> 133.1 ms, an 18 % regression at the high end. Phase 19's 8x CAS speedup is real but measured on idle steady-state at N=10k; it does not generalize to the DDL-burst regime with ~1.6 k concurrent CAS/sec. Decision: defer the default bump. The level win at N=10k–N=20k is 3–10 ms, but N=25k is unchanged and the slope is statistically worse. cfg.rs default reverted to 50; no code change shipped. Across iterations 14–20 the DDL slope picture is now: phase 15 erased a +0.6 ms/+1k contributor; phases 18 (transact_inner + prepare_state, +0.4 ms/+1k combined, bogo-visible) and 20 (pool plus behind-pool, entangled) are the remaining medium-sized contributors. No single big-bang fix remains; marginal return is decreasing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

aljoscha and others added 30 commits May 13, 2026 11:30

fix crdb cluster spec sheet

11564b8

aljoscha and others added 29 commits May 18, 2026 21:29

envd-ddl-scalability: plan phase 15+ — extend to N=20k/25k, iterate

5e405b4

envd-ddl-scalability: phase 15 writeup + 40Hz follow-up note

542a946

aljoscha force-pushed the envd-ddl-scalability branch from b8d836b to d0a4f4b Compare May 21, 2026 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNM: envd scalability work#36634

DNM: envd scalability work#36634
aljoscha wants to merge 88 commits into
MaterializeInc:mainfrom
aljoscha:envd-ddl-scalability

aljoscha commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aljoscha commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant