Skip to content

Refactor: standard install/start/check/stop/load/query interface per system#860

Open
alexey-milovidov wants to merge 7 commits intomainfrom
refactor/per-system-script-interface
Open

Refactor: standard install/start/check/stop/load/query interface per system#860
alexey-milovidov wants to merge 7 commits intomainfrom
refactor/per-system-script-interface

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Split each local system's monolithic benchmark.sh into 7 single-purpose scripts (install, start, check, stop, load, query, data-size) with a stable contract, driven by a new shared lib/benchmark-common.sh.
  • Wrap dataframe / in-process systems (pandas, polars-dataframe, chdb-dataframe, daft-parquet*, duckdb-dataframe, sirius) in small FastAPI servers so they fit the same start/stop/query lifecycle.
  • 88 local systems refactored; cloud/managed systems and a handful of non-functional ones are intentionally untouched.

Why

Previously, every system's benchmark.sh bundled installation, server lifecycle, dataset download, data loading, and query dispatch into one script — and run.sh hard-coded the per-query orchestration. There was no programmatic per-query entry point, so:

  1. Tweaking the dataset, query set, or per-query behavior (e.g. restarting the system between queries to neutralize warm-process effects) required editing every system's scripts individually.
  2. Building an online "run query X against system Y" service was impossible.
  3. Most run.sh ran all 3 tries inside a single CLI invocation, so OS-cache warmth from try 1 leaked into tries 2/3.

The new per-system interface

Script Stdin Stdout Stderr Notes
install - progress progress Idempotent. Env prep + system install.
start - - progress Start daemon. Idempotent. Empty/exit-0 for stateless tools.
check - - progress Trivial query (e.g. SELECT 1). Exit 0 iff responsive.
stop - - progress Stop daemon. Idempotent.
load - progress progress Runs create.sql + loads data; deletes source files then sync.
query one query query result, any format last line: fractional seconds (0.123) Non-zero exit on failure.
data-size - bytes (one integer) - Reports the data footprint.

Each system's benchmark.sh becomes a 4-line shim that sets a couple of env vars and exec's the shared driver:

#!/bin/bash
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned"
export BENCH_RESTARTABLE=yes
exec ../lib/benchmark-common.sh

The shared driver runs install → start+check → download → load (timed) → for each query: flush caches; if BENCH_RESTARTABLE=yes, stop+start; run query 3× → data-size → stop. The output log shape (Load time:, [t1,t2,t3], per query, Data size:) is identical to the old benchmark.sh, so cloud-init.sh.in's POST to play.clickhouse.com keeps working unchanged.

BENCH_RESTARTABLE=no for embedded CLIs (duckdb, sqlite, datafusion, …) and dataframe wrappers — restarting a single CLI/Python process between queries would dominate query time. For these, OS caches are still flushed between queries.

Scope

Refactored (88 systems):

  • Server, restartable: clickhouse, postgresql, mysql, mariadb, monetdb, druid, pinot, vertica, exasol, kinetica, heavyai, questdb, cockroachdb, elasticsearch, ydb, … and the postgres/clickhouse/mysql variants (timescaledb, citus, paradedb, postgresql-indexed, clickhouse-parquet*, clickhouse-datalake*, mysql-myisam, tidb, infobright, …)
  • Embedded CLI, not restartable: duckdb (and variants), sqlite, datafusion (and partitioned), glaredb (and partitioned), hyper, hyper-parquet, octosql, opteryx, sail (and partitioned), drill, turso, chdb, chdb-parquet-partitioned
  • Dataframe with FastAPI wrapper, not restartable: pandas, polars-dataframe, chdb-dataframe, daft-parquet, daft-parquet-partitioned, duckdb-dataframe, sirius
  • Spark family: spark, spark-auron, spark-comet, spark-gluten

Not refactored (intentionally out of scope):

  • Cloud / managed: alloydb, athena, aurora-{mysql,postgresql}, bigquery, clickhouse-cloud, databricks, motherduck, redshift, redshift-serverless, snowflake, hydrolix, firebolt(), hologres, tinybird, hydra, mariadb-columnstore, pg_duckdb, singlestore, supabase, tablespace, tembo-olap, timescale-cloud, crunchy-bridge-for-analytics, s3select, …
  • Non-functional: csvq, dsq, locustdb (panic on first query); exasol, spark-velox (empty dirs)
  • Non-SQL or no SQL CLI: mongodb (JS aggregation pipelines), polars (no SQL CLI; the dataframe variant is wrapped instead)

Validated end-to-end on a 96-core / 185 GB ARM machine

System Data Outcome
clickhouse 14.2 GB / 100M rows Full 43 queries × 3 tries with stop/start between queries; load 124s
duckdb 20.6 GB / 100M rows Full 43 queries × 3 tries (no restart); load 69s
pandas 4.2 GB in-mem (5M-row subset) 42/43 queries; Q43 hit a pandas lambda bug → recorded as null (framework's error path works)
sqlite 3.9 GB (5M-row subset) First 5 queries × 3 tries; load 68s
postgresql 100M rows / 75 GB TSV First 3 queries × 3 tries with restart; load 829s. Cold-cache spike clearly visible (135s → 7s after warmup) — confirms per-query restart actually flushes the page cache

All 88 refactored systems pass bash -n and have executable bits set on the 7 scripts + benchmark.sh.

Bug fixes surfaced during validation

  • lib/benchmark-common.sh: data-size now runs before stop (clickhouse and pandas need the server up to report size).
  • clickhouse/start: idempotent (was erroring when already running).
  • duckdb/load, sqlite/load: rm -f hits.db/mydb for idempotent reruns.
  • postgresql/load: -v ON_ERROR_STOP=1 so COPY data errors actually fail the script instead of silently rolling back.
  • BENCH_DOWNLOAD_SCRIPT may now be empty for systems that read directly from S3 datalakes / remote services (clickhouse-datalake*, duckdb-datalake*, chyt, …).

Flagged for follow-up review

  • duckdb-memory:memory: semantics force a per-query reload; will inflate timings vs. the original single-process flow.
  • cloudberry, greenplum — multi-phase install (reboot between phases); the shim only runs phase 1.
  • sirius — GPU-dependent; long-lived duckdb CLI subprocess proxy; review the stdin/sentinel protocol.
  • paradedb*, pg_ducklake, pg_mooncake — Docker container created in install then docker cp in load (small divergence from the original docker run -v ... due to the lifecycle order: start runs before download).

Test plan

  • bash -n on all 88 systems' scripts
  • clickhouse: full 43-query benchmark.sh on 100M-row real data
  • duckdb: full 43-query benchmark.sh on 100M-row real data
  • pandas: 43-query benchmark.sh on a 5M-row subset
  • sqlite: abbreviated benchmark.sh on a 5M-row subset
  • postgresql: abbreviated benchmark.sh on full 100M-row data
  • Smoke-run on a fresh c6a.metal/equivalent VM via cloud-init for a representative system from each family before merging
  • Verify play.clickhouse.com log-ingestion sink continues to parse the output for at least one production benchmark run

🤖 Generated with Claude Code

alexey-milovidov and others added 3 commits May 7, 2026 12:14
…/data-size

Each local system now exposes a small set of single-purpose scripts with a
stable contract, so they can be driven by a shared lib/benchmark-common.sh
and reused by external tooling (e.g. an online "run query against system X"
service):

  install     env prep + system install (idempotent)
  start       start daemon (idempotent; empty for stateless tools)
  check       trivial query, exit 0 iff responsive
  stop        stop daemon (idempotent)
  load        runs create.sql + loads data, deletes source files, sync
  query       SQL on stdin; result on stdout; runtime in fractional seconds
              on the last line of stderr; non-zero exit on error
  data-size   prints data footprint in bytes (one integer to stdout)

Each system's old monolithic benchmark.sh is replaced by a 4-line shim that
sets a couple of env vars (BENCH_DOWNLOAD_SCRIPT, BENCH_RESTARTABLE) and
exec's lib/benchmark-common.sh. The shared driver runs the unified flow:
install -> start+check -> download -> load (timed) -> for each query
{flush caches; optionally stop+start to neutralize warm-process effects;
run query 3x} -> data-size -> stop. Output format ([t1,t2,t3], Load time,
Data size) matches the previous benchmark.sh exactly so cloud-init.sh.in's
log POST to play.clickhouse.com keeps working unchanged.

For dataframe/in-process systems (pandas, polars-dataframe, chdb-dataframe,
daft-parquet*, duckdb-dataframe, sirius), the engine is wrapped in a small
FastAPI server (server.py) so the start/stop/query interface still applies.
BENCH_RESTARTABLE=no for these (and for embedded CLIs like duckdb, sqlite,
datafusion, etc.) since restarting a single Python/CLI process between
queries would dominate query time.

Scope: 88 local systems refactored. Cloud/managed systems and a handful of
non-functional ones (csvq, dsq, locustdb, mongodb, polars CLI, exasol,
spark-velox) are intentionally left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflict in clickhouse-datalake{,-partitioned}: upstream switched
the datalake variants from filesystem-cache to userspace page-cache (PR #818).
The refactored install/query scripts now adopt the page-cache approach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mongodb: query takes a MongoDB aggregation pipeline (Extended JSON, one
line) on stdin instead of SQL — these are the same canonical 43 ClickBench
queries, just expressed as mongo pipelines. queries.txt is generated from
queries.js (the source of truth) by replacing JS-only constructors
(NumberLong, ISODate, NumberDecimal) with their EJSON canonical form. The
shim sets BENCH_QUERIES_FILE=queries.txt to point the driver at it.

polars: wrapped in a FastAPI server analogous to polars-dataframe, but the
load step uses pl.scan_parquet (LazyFrame) so the parquet file remains
needed at query time — the load script does NOT delete hits.parquet.
data-size returns the on-disk parquet size since a LazyFrame has no
materialized in-memory size.

Both systems now expose the standard install/start/check/stop/load/query/
data-size scripts and a 4-line benchmark.sh shim, removing the old
benchmark.sh / run.js / query.py / formatResult.js paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread clickhouse-datalake-partitioned/load Outdated
…use in query

Per review: clickhouse-local persists table metadata in its --path dir, so
the CREATE TABLE only needs to run once during ./load. ./query just runs
the query against the persisted table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread clickhouse/query Outdated
Comment thread clickhouse/start Outdated
alexey-milovidov and others added 3 commits May 7, 2026 12:29
…atively

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… readiness

Per review (alexey-milovidov): clickhouse start leaves the system in the
desired state (server running) even when it returns non-zero with "already
running". Make the shared driver tolerate non-zero from ./start and rely on
bench_check_loop as the authoritative readiness signal. This lets per-system
start scripts stay simple — they just need to make a best-effort attempt to
launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prmoore77 added a commit to gizmodata/ClickBench that referenced this pull request May 7, 2026
…ouse#860)

Adopts the per-system 7-script interface from ClickHouse#860 for gizmosql/, and
replaces the Java sqlline-based gizmosqlline client with the C++
gizmosql_client shell that ships with gizmosql_server.

Scripts (matching the contract from lib/benchmark-common.sh):
  benchmark.sh - 4-line shim that exec's ../lib/benchmark-common.sh
  install      - apt + curl gizmosql_cli_linux_$ARCH.zip; no openjdk, no
                 separate gizmosqlline download
  start        - idempotent server bring-up (skips if port 31337 is open)
  check        - cheap TCP probe (auth-gated SQL would need credentials)
  stop         - kills tracked PID; pkill belt-and-braces fallback
  load         - rm -f clickbench.db, then create.sql + load.sql via
                 gizmosql_client; deletes hits.parquet and sync's
  query        - reads one query from stdin, runs via gizmosql_client with
                 .timer on + .mode trash; emits fractional seconds as the
                 last stderr line (parsed from "Run Time: X.XXs")
  data-size    - wc -c clickbench.db

Notes:
- BENCH_DOWNLOAD_SCRIPT=download-hits-parquet-single, BENCH_RESTARTABLE=yes
  (gizmosql is a server, so per-query restart neutralizes warm-process
  effects, matching the clickhouse/postgres pattern in ClickHouse#860).
- util.sh now exports GIZMOSQL_HOST/PORT/USER/PASSWORD - the env vars
  gizmosql_client reads natively, so query/load can call gizmosql_client
  with no flags. The server still receives the username via --username.
- PID_FILE moved to a stable /tmp path (was /tmp/gizmosql_server_$$.pid,
  which broke across the start/stop process boundary in the new layout).

This PR depends on ClickHouse#860 (which introduces lib/benchmark-common.sh and the
contract). Once ClickHouse#860 lands, this PR's diff against main will be only
the gizmosql/ files. Validated locally on macOS with gizmosql v1.22.4:
the query script produces the expected fractional-seconds last line on
stdout/stderr separation, and exits non-zero on error paths.

See https://docs.gizmosql.com/#/client for gizmosql_client docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant