Skip to content

Spark-only BerlinMOD benchmark harness consuming the canonical suite#23

Open
estebanzimanyi wants to merge 31 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/berlinmod-benchmark
Open

Spark-only BerlinMOD benchmark harness consuming the canonical suite#23
estebanzimanyi wants to merge 31 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/berlinmod-benchmark

Conversation

@estebanzimanyi

Copy link
Copy Markdown
Member

Stacked on #22 (the canonical UDF library). Review #22 first — this PR's own change is the bench commit on top; its diff cleans up to bench-only once #22 merges to main.

Adds the Q1–Q17 query set with reference expecteds, the loader/runner scripts, the BerlinMODBench/BerlinMODDemo drivers, and the 3-tier index framework (Spark column-store prefilter · th3index cells · PG GiST/SP-GiST) with the NxN cross-join mitigations.

Demo/harness/data only — no change to the library surface, so the unit suite is unchanged (907/907).

@estebanzimanyi estebanzimanyi force-pushed the feat/berlinmod-benchmark branch from b0bc798 to 4d7a5e8 Compare June 12, 2026 10:42
@estebanzimanyi estebanzimanyi force-pushed the feat/berlinmod-benchmark branch 3 times, most recently from 09962c2 to 3c2223c Compare June 12, 2026 20:26
estebanzimanyi added a commit to estebanzimanyi/MobilitySpark that referenced this pull request Jun 12, 2026
Folds the MobilityDB#23 canonical-suite update into the integration/benchmark evidence so
evidence == the deliverable stack (canonical q*.sql, eContains/round, no
Spark-local SQL variants or preprocessing).
tools/codegen_spark_udfs.py emits MobilitySpark UDF-registration classes from the
MEOS-API catalog (output/meos-idl.json), resolving each SQL name to its MEOS-C
backing via the @sqlfn / @sqlop map (MEOS-API MobilityDB#18). Two modes:
- SINGLE: one backing -> a 1:1 UDF (type-marshalling: each MEOS C type <-> its
  parse-from-String / serialize-to-String form).
- DISPATCH: an overloaded SQL name / operator (overlaps via &&, stbox(geom,time),
  timeSpan) -> ONE UDF that classifies each arg by its MEOS type and routes to the
  catalog-determined backing. Classification is MEOS-driven and wire-format-safe:
  spans/stboxes/geometries travel as TEXT, only temporals as hex, so the leading
  token disambiguates ('['/'(' span, STBOX stbox, hex temporal, else geometry) and
  temporal_from_hexwkb is never fed a non-temporal. Emitted lambdas call only static
  GeneratedFunctions (no captured state -> Spark-serializable). Zero hand heuristics,
  zero new MEOS functions.
@estebanzimanyi estebanzimanyi force-pushed the feat/berlinmod-benchmark branch 4 times, most recently from ecdc2ae to 9109d58 Compare June 13, 2026 03:41
@estebanzimanyi estebanzimanyi changed the title BerlinMOD benchmark harness and 3-tier index framework Spark-only BerlinMOD benchmark harness consuming the canonical suite Jun 13, 2026
…roup

Generalize the generator over the whole JMEOS public surface (was a 4-UDF POC):
mirror JMEOS FunctionsGenerator's marshalling conventions — temporals / spans /
sets / boxes / jsonb as hex-WKB or type text, TimestampTz as OffsetDateTime,
DateADT as int, and bool f(.., result) out-params dropped with their value
returned. Cross-check every emission against the JMEOS jar signatures (arity +
return kind) so a collapsed catalog type can never miscompile. Organize the
emitted UDFs into one class per doxygen @InGroup module — the reference-manual
structure, so a function is found in the same place across tools — excluding
meos_internal_*, and splitting oversized groups to stay under the JVM class
limits. Emits ~2200 1:1 UDFs, compiling green.
Add a dispatch pass: each portable comparison bare name (everEq/everNe/everLt/
everLe/everGt/everGe, alwaysEq.., tempEq.. — RFC #920 / contract families
everComparison/alwaysComparison/temporalComparison) is emitted ONCE wrapping its
MEOS superclass entrypoint (ever_<op>_temporal_temporal / always_<op>_temporal_
temporal / temporal_<op>), which dispatches every concrete temporal type
internally from the type-erased hex-WKB string — so Spark needs no per-type
overload and no Java type-inspection. 18 bare names, emitted into
GeneratedUdfs_portable_comparison, compiling green.
Three fixes so the catalog-generated UDF surface compiles against the
pin-12l (4408) JMEOS jar:
- Restore the runtime substrate MeosMemory/MeosThread/MeosNative (memory +
  native-init helpers the generated UDFs depend on). These are infrastructure,
  not hand-written UDF surface, and were dropped with the hand layers.
- Drop the legacy org.mobiltydb typo-package placeholders (Main/PowerUDF/UDT)
  that imported the long-gone jmeos.functions package.
- Fix the generator (tools/codegen_spark_udfs.py) to CLEAN its output dir before
  emitting: a prior run's classes (a function later excluded by the jar
  arity/kind cross-check, or a now-empty/renamed group) otherwise linger and
  silently break the build.
The whole Spark UDF surface is now GENERATED from the catalog at build time
(North Star: bindings generated from MEOS, none hand-written or committed):
- exec-maven-plugin runs tools/codegen_spark_udfs.py at generate-sources,
  reading the vendored 4408 catalog (tools/meos-idl.json) + the org.jmeos:meos
  jar's actual symbols, emitting to target/generated-sources/spark.
- build-helper-maven-plugin adds that as a source root.
- target/ is gitignored; no generated .java is committed.
'mvn clean compile' produces 195 group classes (2216 1:1 UDFs) and BUILDs green.
NOTE: the org.jmeos:meos:1.0 jar must be in the local repo (mvn install:install-file
from the JMEOS build); CI installs it before the Spark build.
GeneratedSurfaceTest registers the catalog-generated UDFs
(GeneratedSpatioTemporalUDFs.registerAll) in a real SparkSession and asserts
results across families via spark.sql against libmeos: temporal_num_instants==3,
tint_start_value==1, tint_out renders the values, tnumber_integral is finite —
the safety gate proving the generated surface binds and executes before the hand
UDF layers (MobilityDB#22/MobilityDB#24/MobilityDB#25/MobilityDB#26) are retired. Adds junit-jupiter + surefire (fork per
class, JDK17 --add-opens) and bumps jnr-ffi to 2.2.17.

NOTE: these Spark-integration tests require JDK 17 — Spark 3.4 cannot init on
JDK 21 (DirectByteBuffer.<init>(long,int) removed) and the Java-17 JMEOS jar
cannot load on JDK 11. CI must run the Spark build/test on JDK 17.
The generator skipped every *_in WKT parser (tint_in/tfloat_in/tgeompoint_in/
geo_from_text-style) because 'char' was deferred in INTERNAL, so the generated
surface could operate on hex but not PARSE literals. jnr already marshals a single
const char* as a Java String, so map it through directly: arg_kind now emits a
StringType pass-through for 'char *'. Coverage 50% -> 53% (+98 UDFs incl. the
parsers). GeneratedSurfaceTest adds a full WKT parse->operate round-trip driven
only by generated UDFs (tint_in -> num_instants==3, tint_out renders) — 4/4 green
on JDK17. Also gitignore the default src/ generate location (build-time gen writes
to target/; a manual run with the default --out would otherwise duplicate classes).
Extends GeneratedSurfaceTest with the extended-type families (tcbuffer 2 instants
+ tcbuffer_radius non-null; tnpoint 2 instants) — widening the runtime safety net
across families before the hand UDF layers are retired. 5/5 green on JDK17.
(H3 is excluded: the installed libmeos is built H3-OFF, so th3index symbols are
absent at runtime; covered once an H3-built libmeos is available.)
- Map uint64_t <-> Java long (jnr) -> Spark LongType for both args and returns.
  Emits +61 1:1 UDFs (53%->54%), mostly the H3Index family; binding-verified by
  the jar arity/kind cross-check + compile (runtime-covered once libmeos is built
  -DH3=ON). Left 'long' UNMAPPED: the catalog uses it ambiguously for things the jar
  treats as Pointer/OffsetDateTime, which a blanket long->LongType would miscompile.
- Default --out to target/generated-sources/spark (the maven build dir), NOT the
  src/ source root: a bare 'python3 codegen' run could otherwise write generated
  classes into src/ that linger past 'mvn clean' (which only cleans target/) and
  duplicate-compile. Build-time generation owns target/.
Extend the generator's dispatch pass beyond the comparison families to the
whole portable operator contract: topology (overlaps/contains/contained/
adjacent), same, time position (before/after/overbefore/overafter), and
space Y/Z all emit once over a MEOS superclass entrypoint
(*_temporal_temporal / *_tspatial_tspatial), and distance maps tdistance ->
tdistance_tgeo_tgeo, nearestApproachDistance -> nad_tgeo_geo. Space X
(left/right/overleft/overright) is the only axis-ambiguous family: a thin
UdfMarshal.axisBool classifier inspects whether arg1 is a tnumber and
selects the value-axis vs the X-axis backing -- the operator's own existing
MEOS symbols, no operator logic. 41/41 contract bare names now generated,
reproducing the complete hand PortableOperatorAliasUDFs surface.

Remove the dead hand-written MeosNative jnr binding (444 lines): the
generated UDFs and the MeosMemory/MeosThread substrate call
functions.GeneratedFunctions from the org.jmeos:meos jar, and nothing
references MeosNative -- it is leftover hand binding superseded by JMEOS.

GeneratedSurfaceTest: add portable_bare_name_dispatch_surface exercising one
operator per family (overlaps/same/overbefore/tempEq/everEq/overleft/
tdistance); 6/6 green on JDK17.
The *_as_hexwkb / *_as_ewkb serializers return char* with a trailing
size_t *size_out out-param that JMEOS swallows (it returns the buffer/String
directly), and take an `unsigned char variant` flag that JMEOS maps to byte.
Both were generator exclusions: classify() now drops a trailing non-const
size_t* (the canonical buffer-length out-param), and unsigned char maps to
ByteType (removed from the INTERNAL skip set). +4 1:1 UDFs -- temporal/span/
set/spanset _as_hexwkb -- 2398 -> 2402. This closes the last canonical-name
gap behind the BerlinMOD bench's asHexWKB usage (the other five bench names
already map: atTime->temporal_at_tstzspan, eDwithin->edwithin_tgeo_tgeo,
eIntersects->eintersects_tgeo_tgeo, trajectory->tpoint_trajectory,
nearestApproachDistance is a portable bare name).

GeneratedSurfaceTest: add as_hexwkb_family_with_swallowed_size_out_param
(a hex round-trip through temporal_as_hexwkb). 7/7 green on JDK17.
…spatch

Every catalog function carries the canonical MobilityDB SQL name in its @sqlfn
tag (numInstants, eIntersects, atTime, asHexWKB, nearestApproachDistance ...)
— the name users and the portable BerlinMOD suite actually call. Emit each UDF
under its @sqlfn name; where one @sqlfn maps several C overloads that share a
marshalled signature (eIntersects <- eintersects_tgeo_tgeo / _tgeo_geo /
_geo_tgeo), emit ONE UDF that dispatches to the matching backing by parsing each
hex-WKB / WKT arg at runtime (parse-all-then-match, leak-free on every path).

325 @sqlfn names (80 arg-kind-dispatched); 1:1 surface 2402 -> 2725 (62%). The
whole pass is data-driven from the catalog — zero hand cases — and made safe by:
- isHex guard: MEOS's hex decoder CRASHES (segfaults) on non-hex input, so a
  *_from_hexwkb parser is only tried when the String is hex; a WKT literal falls
  through to the geo_from_text candidate instead of crashing the JVM.
- subtype dedup: overloads differing only by temporal subtype (tdistance_tgeo_tgeo
  vs _tnpoint_tnpoint) parse identically and can't be told apart — keep one per
  parse-kind tuple, preferring the tgeo/geo family the suite uses; siblings stay
  reachable under their C name.
- parser-safety filter: a dispatcher is emitted only when every overload
  discriminates via hex-WKB / WKT; text *_in overloads (stbox/tbox/cbuffer/npoint/
  pose) are dropped from dispatch (so nearestApproachDistance keeps just tgeo_tgeo
  / tgeo_geo), 12 wholly-unsafe names left to their C names (logged, not guessed).
- ever/always predicate convention: <e|a><Verb> @sqlfn returning int in C is
  boolean in SQL -> marshal to Boolean (== 1).

Drop the now-redundant portable distance pass (tdistance / nearestApproachDistance):
its single tgeo_geo backing wrongly nulled a trip-vs-trip call; the @sqlfn pass
supersedes it with correct arg-kind dispatch.

GeneratedSurfaceTest: add sqlfn_canonical_names_with_argkind_dispatch (eIntersects
tgeo/tgeo + tgeo/WKT, eDwithin 3-arg, nearestApproachDistance tgeo/tgeo, trajectory,
numInstants). 8/8 green on JDK17.
Regenerated meos-idl.json from MobilityDB ecosystem pin 14h (80ddc3d6c) with
the consolidated MEOS-API extractor (sqlfn-name-map + camelCase comparison
families + jsonb/jsonpath recovery + doxygen groups + the new SQL-arity pass).
The stale vendored catalog predated several upstream fixes; this bump flows
them into the generated Spark surface:

- eintersects_tgeo_geo now maps to @sqlfn eIntersects (was the copy-pasted
  aIntersects, MobilityDB #1200) -> eIntersects(tgeo, geometry) is restored,
  and the combined ea_* impls are tagged meos_internal (#1206) so eIntersects
  registers as the correct 2-arg UDF instead of leaking the 3-arg ea_ form.
- @InGroup doxygroups present -> the 618 genuinely-internal functions are
  excluded again (coverage reads an honest 61%, not the inflated 66% from a
  groupless catalog that leaked internals).
- sqlArity / sqlArityMax attached (MEOS-API#1) for the eventual SQL-faithful
  arity pass (trajectory/asHexWKB flag + out-param args).

GeneratedSurfaceTest 8/8 green on JDK17; eIntersects/eDwithin/nad/tempEq all
register and dispatch correctly against the refreshed surface.
The branch had no CI and a Spark 3.4 / Java 17 pom that diverged from the
repository's Java 21 / Spark 3.5 matrix (Spark 3.4 cannot initialise on
Java 21). Bump the pom to Spark 3.5.1 + Java 21, and add a Maven CI workflow
(Linux) that builds libmeos from ecosystem pin 14h (the same commit the
vendored catalog and the bundled JMEOS jar are generated against), installs
the JMEOS jar as org.jmeos:meos:1.0, then runs the catalog generator and the
GeneratedSurfaceTest. 8/8 green locally on Java 21 / Spark 3.5.1.

Bundles libs/JMEOS.jar (the generator reads its symbols at build time and the
UDFs call it at runtime; not on Maven Central).
Retire the hand-written UDF layers: the bench now consumes the catalog-
GENERATED surface (build-time codegen from the MEOS-API catalog at pin 14h)
via MobilitySparkSession.create -> GeneratedSpatioTemporalUDFs.registerAll,
dropping all the hand UDF classes, the committed generated/*.java snapshot,
and the org.mobiltydb scaffolding.

Single canonical source, no tool-local copies: queries come ONLY from the
canonical berlinmod/suite submodule (berlinmod-portability, shared by every
engine) run by BerlinMODBench; data from the shared generator
setup/generate_data.sh (-> MobilityDB-BerlinMOD). Removed BerlinMODDemo, which
duplicated both the queries (inline spark.sql) and the corpus (loadSynthetic);
its canonical CSV loader moved into BerlinMODBench.

Build on Java 21 / Spark 3.5 with the Linux Maven CI (libmeos from pin 14h,
the bundled JMEOS jar, the generator, GeneratedSurfaceTest) plus the shade fat
jar for spark-submit. 8/8 green; mvn package emits ...-spark.jar.
Picks up MobilityDB#28's type-safe UdfMarshal.*FromHex parsers (no SIGSEGV on a foreign
hex-WKB family) + topology span/temporal dispatch + sqlArity-faithful arity, so
the canonical BerlinMOD suite runs without the overlaps crash and asHexWKB/
trajectory resolve 1-arg.
@estebanzimanyi estebanzimanyi force-pushed the feat/berlinmod-benchmark branch from 941cd49 to 2b28850 Compare June 14, 2026 12:56
14i folds the WKB-reader type validation (#1212) that fixes the SIGSEGV on a
wrong-family hex-WKB buffer (the overlaps(span,span) / asHexWKB-dispatch crash
the BerlinMOD probe surfaced) at the MEOS source. The vendored catalog is
unchanged (14h == 14i catalog-wise; #1211/#1212 are .c-only), so only the
runtime libmeos moves.
atTime/minusTime now emit one String-classifying UDF (Spark can't overload), and
timestamp/TZ resolution routes through MEOS pg_timestamptz_in (ecosystem-uniform),
so the canonical BerlinMOD q03/q04 atTime(trip, instant|period) calls resolve.
…he tags)

BerlinMODBench.loadFromCsv now loads the raw CSVs into the *Input views and runs
the CANONICAL berlinmod/suite/load.sql to build the trip_h3 / geom_h3 prefilter
columns via th3index(trip,7) / geoToH3IndexSet(geom,7) — the single shared source
every engine runs, no Spark-local H3 logic (the only adaptation is the DDL form,
CREATE TABLE -> CREATE OR REPLACE TEMP VIEW). instant/period are kept as RAW
Strings so the atTime UDF parses them through MEOS (pg_timestamptz_in / tstzspan_in).

PENDING three upstream items before the NxN/prefilter queries run end-to-end:
  1. MEOS-C @csqlfn tags on tgeompoint_to_th3index / tgeogpoint_to_th3index /
     geo_to_h3index_set so the catalog exposes th3index / geoToH3IndexSet (relayed);
  2. the canonical load.sql adds the geomWKT alias q11/q12/q15 reference;
  3. the sample CSVs regenerated as EPSG:4326 (H3 raises on a non-latlong SRID).
Builds + GeneratedSurfaceTest 8/8 green (the unit test path does not run load.sql).
…tead

The corpus is owned by the canonical BerlinMOD repository (MobilityDB-BerlinMOD,
berlinmod_portability_export); a per-tool committed copy of the CSVs is the
irregularity to wipe out — and this one had drifted (mixed/projected SRID, not
the EPSG:4326 the th3index/geoToH3IndexSet prefilter requires).

- Remove berlinmod/data/*.csv (7 files) and gitignore the directory: the corpus
  is produced by the canonical setup/generate_data.sh -> berlinmod_portability_export,
  never committed.
- generate_data.sh exports at the canonical EPSG:4326 default (was a local 3812
  override); load.sql then builds the H3 columns, and nearestApproachDistance on
  the 4326 geodetic trips returns true metres.

The bench consumes the canonical loader (berlinmod/suite/load.sql) with the
canonical data (MobilityDB-BerlinMOD) via its dataDir arg — one data source, one
loader, zero copies.
…enerated

Mirror the binding cleanup: the bench registers only the generated
GeneratedSpatioTemporalUDFs.registerAll — not a single hand-registered UDF.
The MeosThread.wrap() helpers had no callers; remove them and the unused
spark.sql.api.java import, keeping the per-thread MEOS-init guard.
Re-vendor tools/meos-idl.json from the MEOS-API catalog regenerated against
ecosystem-pin-2026-06-14l (de8b322483), drop in the matching 14l JMEOS jar
(4466 methods, 2-arg count accessors), and bump the CI pin. GeneratedSurfaceTest
is 8/8 green against libmeos 14l.
Propagate the generator's array-in/SETOF support (the *_tgeoarr_tgeoarr template:
minDistance array overload + eDwithin/tDwithin/aDisjointPairs as array<struct<i,j
[,periods]>>) and re-vendor the 14m catalog. Bump the CI pin to ecosystem-pin-2026-06-14m.
The generated bench surface now carries th3index + the NxN array UDFs the updated
BerlinMOD suite (berlinmod-portability #3/#4/#5) needs to run index-less. 8/8 green.
…6/q10)

The canonical q05/q06/q10 use SQL that Spark cannot express: an ordered aggregate
array_agg(x ORDER BY k) (Spark has no ordered aggregate — PARSE_SYNTAX_ERROR) and the
SETOF table-function form f(...) AS p(i,j). BerlinMODBench now prefers a documented
berlinmod/spark-dialect/<q>.sql override when present, else the canonical file. The
overrides call the SAME generated MEOS UDFs (minDistance / eDwithinPairs / tDwithinPairs)
and return the SAME result — collect_list + array_sort + transform to parallel arrays,
then LATERAL VIEW explode of the array<struct>. Only the SQL glue differs, because the
engines genuinely diverge on these constructs.
…st literal)

Two fixes proven by running q05/q06/q10 end-to-end on real BerlinMOD data (500 trips):
- The generated `transform` geo UDF shadows Spark's `transform` higher-order function,
  so the q06/q10 dialect forms must extract parallel arrays via Spark's array-of-struct
  field access `rows.trip` (not `transform(rows, x -> x.trip)`).
- Propagate the generator fix coercing the NxN dist arg via Number, so the bare `10.0`
  / `3.0` literals in the dialect queries work (Spark parses them as decimal).
Result: q05 minDistance 820 pairs, q06 eDwithinPairs 68 truck-pairs, q10 tDwithinPairs
7258 rows — through the generated UDFs against libmeos 14m.
Re-vendor the 15a catalog (operator dialect: eEq/aEq/tEq, overlaps/tDistance/tAdd/
setUnion/tConcat) and propagate the data-driven GeneratedSurfaceTest whose dispatch
assertions read the bare names from the catalog's byOperator (op("?=") etc.) rather than
hard-coding them — so the dialect rename doesn't break the test. 9/9 green vs libmeos 15a.
Re-vendor the 15c catalog + the matching JMEOS jar (4469: + the pointcloud
eintersects_tpcpoint_geo / nad_tpcpoint_geo). Dialect unchanged from 15a; 9/9 green
vs libmeos 15c.
The 15d catalog's geoToH3IndexSet now extracts (empty-parens @sqlfn), so the th3-prefilter
load.sql resolves on Spark. JMEOS jar reused (C surface unchanged). 9/9 green vs libmeos 15d.
15d's pinned MEOS tree failed to compile (npoint_test.c concatenated/duplicate main);
15e restores it, full tree builds clean. Catalog byte-identical to 15d; pin tag bump
only. Green vs libmeos 15e.
…overload, hex-reject guard

Vendored-generator sync with MobilitySpark MobilityDB#28: UdfMarshal.geoFromText (EWKT SRID-preserving,
hex-rejecting) for geo args + the eEq/eNe comparison dispatch now folds in the non-temporal
overloads (eEq(h3indexset, th3index) — the th3 cell-set prefilter). The prefilter q02/q04/
q11–q17 build geoToH3IndexSet/th3index/eEq and run end-to-end. 9/9 green vs libmeos 15e.
…oH3Cell)

15f restores MEOS_TLS thread-safety → the th3 prefilter bench runs clean under Spark
local[*] (q02 15.5s, no crash, vs SIGSEGV on 15e). geoToH3Cell now emits (#1204). Catalog
4457 fns; JMEOS jar reused (superset). 9/9 green vs libmeos 15f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant