Spark-only BerlinMOD benchmark harness consuming the canonical suite#23
Open
estebanzimanyi wants to merge 31 commits into
Open
Spark-only BerlinMOD benchmark harness consuming the canonical suite#23estebanzimanyi wants to merge 31 commits into
estebanzimanyi wants to merge 31 commits into
Conversation
b0bc798 to
4d7a5e8
Compare
This was referenced Jun 12, 2026
09962c2 to
3c2223c
Compare
estebanzimanyi
added a commit
to estebanzimanyi/MobilitySpark
that referenced
this pull request
Jun 12, 2026
Folds the MobilityDB#23 canonical-suite update into the integration/benchmark evidence so evidence == the deliverable stack (canonical q*.sql, eContains/round, no Spark-local SQL variants or preprocessing).
tools/codegen_spark_udfs.py emits MobilitySpark UDF-registration classes from the MEOS-API catalog (output/meos-idl.json), resolving each SQL name to its MEOS-C backing via the @sqlfn / @sqlop map (MEOS-API MobilityDB#18). Two modes: - SINGLE: one backing -> a 1:1 UDF (type-marshalling: each MEOS C type <-> its parse-from-String / serialize-to-String form). - DISPATCH: an overloaded SQL name / operator (overlaps via &&, stbox(geom,time), timeSpan) -> ONE UDF that classifies each arg by its MEOS type and routes to the catalog-determined backing. Classification is MEOS-driven and wire-format-safe: spans/stboxes/geometries travel as TEXT, only temporals as hex, so the leading token disambiguates ('['/'(' span, STBOX stbox, hex temporal, else geometry) and temporal_from_hexwkb is never fed a non-temporal. Emitted lambdas call only static GeneratedFunctions (no captured state -> Spark-serializable). Zero hand heuristics, zero new MEOS functions.
ecdc2ae to
9109d58
Compare
…roup Generalize the generator over the whole JMEOS public surface (was a 4-UDF POC): mirror JMEOS FunctionsGenerator's marshalling conventions — temporals / spans / sets / boxes / jsonb as hex-WKB or type text, TimestampTz as OffsetDateTime, DateADT as int, and bool f(.., result) out-params dropped with their value returned. Cross-check every emission against the JMEOS jar signatures (arity + return kind) so a collapsed catalog type can never miscompile. Organize the emitted UDFs into one class per doxygen @InGroup module — the reference-manual structure, so a function is found in the same place across tools — excluding meos_internal_*, and splitting oversized groups to stay under the JVM class limits. Emits ~2200 1:1 UDFs, compiling green.
Add a dispatch pass: each portable comparison bare name (everEq/everNe/everLt/ everLe/everGt/everGe, alwaysEq.., tempEq.. — RFC #920 / contract families everComparison/alwaysComparison/temporalComparison) is emitted ONCE wrapping its MEOS superclass entrypoint (ever_<op>_temporal_temporal / always_<op>_temporal_ temporal / temporal_<op>), which dispatches every concrete temporal type internally from the type-erased hex-WKB string — so Spark needs no per-type overload and no Java type-inspection. 18 bare names, emitted into GeneratedUdfs_portable_comparison, compiling green.
Three fixes so the catalog-generated UDF surface compiles against the pin-12l (4408) JMEOS jar: - Restore the runtime substrate MeosMemory/MeosThread/MeosNative (memory + native-init helpers the generated UDFs depend on). These are infrastructure, not hand-written UDF surface, and were dropped with the hand layers. - Drop the legacy org.mobiltydb typo-package placeholders (Main/PowerUDF/UDT) that imported the long-gone jmeos.functions package. - Fix the generator (tools/codegen_spark_udfs.py) to CLEAN its output dir before emitting: a prior run's classes (a function later excluded by the jar arity/kind cross-check, or a now-empty/renamed group) otherwise linger and silently break the build.
The whole Spark UDF surface is now GENERATED from the catalog at build time (North Star: bindings generated from MEOS, none hand-written or committed): - exec-maven-plugin runs tools/codegen_spark_udfs.py at generate-sources, reading the vendored 4408 catalog (tools/meos-idl.json) + the org.jmeos:meos jar's actual symbols, emitting to target/generated-sources/spark. - build-helper-maven-plugin adds that as a source root. - target/ is gitignored; no generated .java is committed. 'mvn clean compile' produces 195 group classes (2216 1:1 UDFs) and BUILDs green. NOTE: the org.jmeos:meos:1.0 jar must be in the local repo (mvn install:install-file from the JMEOS build); CI installs it before the Spark build.
GeneratedSurfaceTest registers the catalog-generated UDFs (GeneratedSpatioTemporalUDFs.registerAll) in a real SparkSession and asserts results across families via spark.sql against libmeos: temporal_num_instants==3, tint_start_value==1, tint_out renders the values, tnumber_integral is finite — the safety gate proving the generated surface binds and executes before the hand UDF layers (MobilityDB#22/MobilityDB#24/MobilityDB#25/MobilityDB#26) are retired. Adds junit-jupiter + surefire (fork per class, JDK17 --add-opens) and bumps jnr-ffi to 2.2.17. NOTE: these Spark-integration tests require JDK 17 — Spark 3.4 cannot init on JDK 21 (DirectByteBuffer.<init>(long,int) removed) and the Java-17 JMEOS jar cannot load on JDK 11. CI must run the Spark build/test on JDK 17.
The generator skipped every *_in WKT parser (tint_in/tfloat_in/tgeompoint_in/ geo_from_text-style) because 'char' was deferred in INTERNAL, so the generated surface could operate on hex but not PARSE literals. jnr already marshals a single const char* as a Java String, so map it through directly: arg_kind now emits a StringType pass-through for 'char *'. Coverage 50% -> 53% (+98 UDFs incl. the parsers). GeneratedSurfaceTest adds a full WKT parse->operate round-trip driven only by generated UDFs (tint_in -> num_instants==3, tint_out renders) — 4/4 green on JDK17. Also gitignore the default src/ generate location (build-time gen writes to target/; a manual run with the default --out would otherwise duplicate classes).
Extends GeneratedSurfaceTest with the extended-type families (tcbuffer 2 instants + tcbuffer_radius non-null; tnpoint 2 instants) — widening the runtime safety net across families before the hand UDF layers are retired. 5/5 green on JDK17. (H3 is excluded: the installed libmeos is built H3-OFF, so th3index symbols are absent at runtime; covered once an H3-built libmeos is available.)
- Map uint64_t <-> Java long (jnr) -> Spark LongType for both args and returns. Emits +61 1:1 UDFs (53%->54%), mostly the H3Index family; binding-verified by the jar arity/kind cross-check + compile (runtime-covered once libmeos is built -DH3=ON). Left 'long' UNMAPPED: the catalog uses it ambiguously for things the jar treats as Pointer/OffsetDateTime, which a blanket long->LongType would miscompile. - Default --out to target/generated-sources/spark (the maven build dir), NOT the src/ source root: a bare 'python3 codegen' run could otherwise write generated classes into src/ that linger past 'mvn clean' (which only cleans target/) and duplicate-compile. Build-time generation owns target/.
Extend the generator's dispatch pass beyond the comparison families to the whole portable operator contract: topology (overlaps/contains/contained/ adjacent), same, time position (before/after/overbefore/overafter), and space Y/Z all emit once over a MEOS superclass entrypoint (*_temporal_temporal / *_tspatial_tspatial), and distance maps tdistance -> tdistance_tgeo_tgeo, nearestApproachDistance -> nad_tgeo_geo. Space X (left/right/overleft/overright) is the only axis-ambiguous family: a thin UdfMarshal.axisBool classifier inspects whether arg1 is a tnumber and selects the value-axis vs the X-axis backing -- the operator's own existing MEOS symbols, no operator logic. 41/41 contract bare names now generated, reproducing the complete hand PortableOperatorAliasUDFs surface. Remove the dead hand-written MeosNative jnr binding (444 lines): the generated UDFs and the MeosMemory/MeosThread substrate call functions.GeneratedFunctions from the org.jmeos:meos jar, and nothing references MeosNative -- it is leftover hand binding superseded by JMEOS. GeneratedSurfaceTest: add portable_bare_name_dispatch_surface exercising one operator per family (overlaps/same/overbefore/tempEq/everEq/overleft/ tdistance); 6/6 green on JDK17.
The *_as_hexwkb / *_as_ewkb serializers return char* with a trailing size_t *size_out out-param that JMEOS swallows (it returns the buffer/String directly), and take an `unsigned char variant` flag that JMEOS maps to byte. Both were generator exclusions: classify() now drops a trailing non-const size_t* (the canonical buffer-length out-param), and unsigned char maps to ByteType (removed from the INTERNAL skip set). +4 1:1 UDFs -- temporal/span/ set/spanset _as_hexwkb -- 2398 -> 2402. This closes the last canonical-name gap behind the BerlinMOD bench's asHexWKB usage (the other five bench names already map: atTime->temporal_at_tstzspan, eDwithin->edwithin_tgeo_tgeo, eIntersects->eintersects_tgeo_tgeo, trajectory->tpoint_trajectory, nearestApproachDistance is a portable bare name). GeneratedSurfaceTest: add as_hexwkb_family_with_swallowed_size_out_param (a hex round-trip through temporal_as_hexwkb). 7/7 green on JDK17.
…spatch Every catalog function carries the canonical MobilityDB SQL name in its @sqlfn tag (numInstants, eIntersects, atTime, asHexWKB, nearestApproachDistance ...) — the name users and the portable BerlinMOD suite actually call. Emit each UDF under its @sqlfn name; where one @sqlfn maps several C overloads that share a marshalled signature (eIntersects <- eintersects_tgeo_tgeo / _tgeo_geo / _geo_tgeo), emit ONE UDF that dispatches to the matching backing by parsing each hex-WKB / WKT arg at runtime (parse-all-then-match, leak-free on every path). 325 @sqlfn names (80 arg-kind-dispatched); 1:1 surface 2402 -> 2725 (62%). The whole pass is data-driven from the catalog — zero hand cases — and made safe by: - isHex guard: MEOS's hex decoder CRASHES (segfaults) on non-hex input, so a *_from_hexwkb parser is only tried when the String is hex; a WKT literal falls through to the geo_from_text candidate instead of crashing the JVM. - subtype dedup: overloads differing only by temporal subtype (tdistance_tgeo_tgeo vs _tnpoint_tnpoint) parse identically and can't be told apart — keep one per parse-kind tuple, preferring the tgeo/geo family the suite uses; siblings stay reachable under their C name. - parser-safety filter: a dispatcher is emitted only when every overload discriminates via hex-WKB / WKT; text *_in overloads (stbox/tbox/cbuffer/npoint/ pose) are dropped from dispatch (so nearestApproachDistance keeps just tgeo_tgeo / tgeo_geo), 12 wholly-unsafe names left to their C names (logged, not guessed). - ever/always predicate convention: <e|a><Verb> @sqlfn returning int in C is boolean in SQL -> marshal to Boolean (== 1). Drop the now-redundant portable distance pass (tdistance / nearestApproachDistance): its single tgeo_geo backing wrongly nulled a trip-vs-trip call; the @sqlfn pass supersedes it with correct arg-kind dispatch. GeneratedSurfaceTest: add sqlfn_canonical_names_with_argkind_dispatch (eIntersects tgeo/tgeo + tgeo/WKT, eDwithin 3-arg, nearestApproachDistance tgeo/tgeo, trajectory, numInstants). 8/8 green on JDK17.
Regenerated meos-idl.json from MobilityDB ecosystem pin 14h (80ddc3d6c) with the consolidated MEOS-API extractor (sqlfn-name-map + camelCase comparison families + jsonb/jsonpath recovery + doxygen groups + the new SQL-arity pass). The stale vendored catalog predated several upstream fixes; this bump flows them into the generated Spark surface: - eintersects_tgeo_geo now maps to @sqlfn eIntersects (was the copy-pasted aIntersects, MobilityDB #1200) -> eIntersects(tgeo, geometry) is restored, and the combined ea_* impls are tagged meos_internal (#1206) so eIntersects registers as the correct 2-arg UDF instead of leaking the 3-arg ea_ form. - @InGroup doxygroups present -> the 618 genuinely-internal functions are excluded again (coverage reads an honest 61%, not the inflated 66% from a groupless catalog that leaked internals). - sqlArity / sqlArityMax attached (MEOS-API#1) for the eventual SQL-faithful arity pass (trajectory/asHexWKB flag + out-param args). GeneratedSurfaceTest 8/8 green on JDK17; eIntersects/eDwithin/nad/tempEq all register and dispatch correctly against the refreshed surface.
The branch had no CI and a Spark 3.4 / Java 17 pom that diverged from the repository's Java 21 / Spark 3.5 matrix (Spark 3.4 cannot initialise on Java 21). Bump the pom to Spark 3.5.1 + Java 21, and add a Maven CI workflow (Linux) that builds libmeos from ecosystem pin 14h (the same commit the vendored catalog and the bundled JMEOS jar are generated against), installs the JMEOS jar as org.jmeos:meos:1.0, then runs the catalog generator and the GeneratedSurfaceTest. 8/8 green locally on Java 21 / Spark 3.5.1. Bundles libs/JMEOS.jar (the generator reads its symbols at build time and the UDFs call it at runtime; not on Maven Central).
Retire the hand-written UDF layers: the bench now consumes the catalog- GENERATED surface (build-time codegen from the MEOS-API catalog at pin 14h) via MobilitySparkSession.create -> GeneratedSpatioTemporalUDFs.registerAll, dropping all the hand UDF classes, the committed generated/*.java snapshot, and the org.mobiltydb scaffolding. Single canonical source, no tool-local copies: queries come ONLY from the canonical berlinmod/suite submodule (berlinmod-portability, shared by every engine) run by BerlinMODBench; data from the shared generator setup/generate_data.sh (-> MobilityDB-BerlinMOD). Removed BerlinMODDemo, which duplicated both the queries (inline spark.sql) and the corpus (loadSynthetic); its canonical CSV loader moved into BerlinMODBench. Build on Java 21 / Spark 3.5 with the Linux Maven CI (libmeos from pin 14h, the bundled JMEOS jar, the generator, GeneratedSurfaceTest) plus the shade fat jar for spark-submit. 8/8 green; mvn package emits ...-spark.jar.
0f42d1d to
fe488cc
Compare
This was referenced Jun 14, 2026
Picks up MobilityDB#28's type-safe UdfMarshal.*FromHex parsers (no SIGSEGV on a foreign hex-WKB family) + topology span/temporal dispatch + sqlArity-faithful arity, so the canonical BerlinMOD suite runs without the overlaps crash and asHexWKB/ trajectory resolve 1-arg.
941cd49 to
2b28850
Compare
14i folds the WKB-reader type validation (#1212) that fixes the SIGSEGV on a wrong-family hex-WKB buffer (the overlaps(span,span) / asHexWKB-dispatch crash the BerlinMOD probe surfaced) at the MEOS source. The vendored catalog is unchanged (14h == 14i catalog-wise; #1211/#1212 are .c-only), so only the runtime libmeos moves.
atTime/minusTime now emit one String-classifying UDF (Spark can't overload), and timestamp/TZ resolution routes through MEOS pg_timestamptz_in (ecosystem-uniform), so the canonical BerlinMOD q03/q04 atTime(trip, instant|period) calls resolve.
…he tags)
BerlinMODBench.loadFromCsv now loads the raw CSVs into the *Input views and runs
the CANONICAL berlinmod/suite/load.sql to build the trip_h3 / geom_h3 prefilter
columns via th3index(trip,7) / geoToH3IndexSet(geom,7) — the single shared source
every engine runs, no Spark-local H3 logic (the only adaptation is the DDL form,
CREATE TABLE -> CREATE OR REPLACE TEMP VIEW). instant/period are kept as RAW
Strings so the atTime UDF parses them through MEOS (pg_timestamptz_in / tstzspan_in).
PENDING three upstream items before the NxN/prefilter queries run end-to-end:
1. MEOS-C @csqlfn tags on tgeompoint_to_th3index / tgeogpoint_to_th3index /
geo_to_h3index_set so the catalog exposes th3index / geoToH3IndexSet (relayed);
2. the canonical load.sql adds the geomWKT alias q11/q12/q15 reference;
3. the sample CSVs regenerated as EPSG:4326 (H3 raises on a non-latlong SRID).
Builds + GeneratedSurfaceTest 8/8 green (the unit test path does not run load.sql).
…tead The corpus is owned by the canonical BerlinMOD repository (MobilityDB-BerlinMOD, berlinmod_portability_export); a per-tool committed copy of the CSVs is the irregularity to wipe out — and this one had drifted (mixed/projected SRID, not the EPSG:4326 the th3index/geoToH3IndexSet prefilter requires). - Remove berlinmod/data/*.csv (7 files) and gitignore the directory: the corpus is produced by the canonical setup/generate_data.sh -> berlinmod_portability_export, never committed. - generate_data.sh exports at the canonical EPSG:4326 default (was a local 3812 override); load.sql then builds the H3 columns, and nearestApproachDistance on the 4326 geodetic trips returns true metres. The bench consumes the canonical loader (berlinmod/suite/load.sql) with the canonical data (MobilityDB-BerlinMOD) via its dataDir arg — one data source, one loader, zero copies.
…enerated Mirror the binding cleanup: the bench registers only the generated GeneratedSpatioTemporalUDFs.registerAll — not a single hand-registered UDF. The MeosThread.wrap() helpers had no callers; remove them and the unused spark.sql.api.java import, keeping the per-thread MEOS-init guard.
Re-vendor tools/meos-idl.json from the MEOS-API catalog regenerated against ecosystem-pin-2026-06-14l (de8b322483), drop in the matching 14l JMEOS jar (4466 methods, 2-arg count accessors), and bump the CI pin. GeneratedSurfaceTest is 8/8 green against libmeos 14l.
Propagate the generator's array-in/SETOF support (the *_tgeoarr_tgeoarr template: minDistance array overload + eDwithin/tDwithin/aDisjointPairs as array<struct<i,j [,periods]>>) and re-vendor the 14m catalog. Bump the CI pin to ecosystem-pin-2026-06-14m. The generated bench surface now carries th3index + the NxN array UDFs the updated BerlinMOD suite (berlinmod-portability #3/#4/#5) needs to run index-less. 8/8 green.
…6/q10) The canonical q05/q06/q10 use SQL that Spark cannot express: an ordered aggregate array_agg(x ORDER BY k) (Spark has no ordered aggregate — PARSE_SYNTAX_ERROR) and the SETOF table-function form f(...) AS p(i,j). BerlinMODBench now prefers a documented berlinmod/spark-dialect/<q>.sql override when present, else the canonical file. The overrides call the SAME generated MEOS UDFs (minDistance / eDwithinPairs / tDwithinPairs) and return the SAME result — collect_list + array_sort + transform to parallel arrays, then LATERAL VIEW explode of the array<struct>. Only the SQL glue differs, because the engines genuinely diverge on these constructs.
…st literal) Two fixes proven by running q05/q06/q10 end-to-end on real BerlinMOD data (500 trips): - The generated `transform` geo UDF shadows Spark's `transform` higher-order function, so the q06/q10 dialect forms must extract parallel arrays via Spark's array-of-struct field access `rows.trip` (not `transform(rows, x -> x.trip)`). - Propagate the generator fix coercing the NxN dist arg via Number, so the bare `10.0` / `3.0` literals in the dialect queries work (Spark parses them as decimal). Result: q05 minDistance 820 pairs, q06 eDwithinPairs 68 truck-pairs, q10 tDwithinPairs 7258 rows — through the generated UDFs against libmeos 14m.
Re-vendor the 15a catalog (operator dialect: eEq/aEq/tEq, overlaps/tDistance/tAdd/
setUnion/tConcat) and propagate the data-driven GeneratedSurfaceTest whose dispatch
assertions read the bare names from the catalog's byOperator (op("?=") etc.) rather than
hard-coding them — so the dialect rename doesn't break the test. 9/9 green vs libmeos 15a.
Re-vendor the 15c catalog + the matching JMEOS jar (4469: + the pointcloud eintersects_tpcpoint_geo / nad_tpcpoint_geo). Dialect unchanged from 15a; 9/9 green vs libmeos 15c.
The 15d catalog's geoToH3IndexSet now extracts (empty-parens @sqlfn), so the th3-prefilter load.sql resolves on Spark. JMEOS jar reused (C surface unchanged). 9/9 green vs libmeos 15d.
15d's pinned MEOS tree failed to compile (npoint_test.c concatenated/duplicate main); 15e restores it, full tree builds clean. Catalog byte-identical to 15d; pin tag bump only. Green vs libmeos 15e.
…overload, hex-reject guard Vendored-generator sync with MobilitySpark MobilityDB#28: UdfMarshal.geoFromText (EWKT SRID-preserving, hex-rejecting) for geo args + the eEq/eNe comparison dispatch now folds in the non-temporal overloads (eEq(h3indexset, th3index) — the th3 cell-set prefilter). The prefilter q02/q04/ q11–q17 build geoToH3IndexSet/th3index/eEq and run end-to-end. 9/9 green vs libmeos 15e.
…oH3Cell) 15f restores MEOS_TLS thread-safety → the th3 prefilter bench runs clean under Spark local[*] (q02 15.5s, no crash, vs SIGSEGV on 15e). geoToH3Cell now emits (#1204). Catalog 4457 fns; JMEOS jar reused (superset). 9/9 green vs libmeos 15f.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #22 (the canonical UDF library). Review #22 first — this PR's own change is the bench commit on top; its diff cleans up to bench-only once #22 merges to
main.Adds the Q1–Q17 query set with reference expecteds, the loader/runner scripts, the
BerlinMODBench/BerlinMODDemodrivers, and the 3-tier index framework (Spark column-store prefilter · th3index cells · PG GiST/SP-GiST) with the NxN cross-join mitigations.Demo/harness/data only — no change to the library surface, so the unit suite is unchanged (907/907).