MobilityDB · estebanzimanyi · Jun 12, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 14, 2026
diff --git a/.github/workflows/maven.yml b/.github/workflows/maven.yml
@@ -0,0 +1,67 @@
+name: Maven CI
+
+on:
+  push:
+    branches: ["**"]
+  pull_request:
+
+jobs:
+  # ── Linux ────────────────────────────────────────────────────────────────────
+  linux:
+    name: Build and test — Linux (Java 21 / Spark 3.5)
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Java 21
+        uses: actions/setup-java@v4
+        with:
+          distribution: temurin
+          java-version: "21"
+          cache: maven
+
+      - name: Install MEOS build dependencies
+        run: |
+          sudo apt-get update -qq
+          sudo apt-get install -y \
+            cmake ninja-build \
+            libjson-c-dev libgeos-dev libproj-dev libgsl-dev libh3-dev
+
+      - name: Checkout MobilityDB source (for MEOS build)
+        uses: actions/checkout@v4
+        with:
+          # Ecosystem pin: the SAME commit the vendored catalog (tools/meos-idl.json)
+          # and the bundled JMEOS jar are generated against.
+          repository: estebanzimanyi/MobilityDB
+          ref: ecosystem-pin-2026-06-15f
+          path: MobilityDB-src
+
+      - name: Build and install libmeos.so
+        run: |
+          # The build dir lives inside MobilityDB-src so the vendored pgtypes headers
+          # ("../../meos/include/...") resolve against the source tree.
+          cmake -S MobilityDB-src -B MobilityDB-src/meos-build \
+            -G Ninja \
+            -DCMAKE_BUILD_TYPE=Release \
+            -DMEOS=ON \
+            -DCBUFFER=ON -DNPOINT=ON -DPOSE=ON -DRGEO=ON \
+            -DH3=ON \
+            -DH3_LIBRARY=/usr/lib/x86_64-linux-gnu/libh3.so \
+            -DH3_INCLUDE_DIR=/usr/include/h3
+          cmake --build MobilityDB-src/meos-build -j
+          sudo cmake --install MobilityDB-src/meos-build
+          echo "LD_LIBRARY_PATH=/usr/local/lib" >> "$GITHUB_ENV"
+
+      - name: Install the bundled JMEOS jar into the local repo
+        # The generator reads this jar's symbols at build time and the UDFs call it at
+        # runtime (org.jmeos:meos:1.0); it is not on Maven Central.
+        run: |
+          mvn -B install:install-file \
+            -Dfile=libs/JMEOS.jar \
+            -DgroupId=org.jmeos -DartifactId=meos -Dversion=1.0 -Dpackaging=jar
+
+      - name: Build + generate + unit tests
+        # generate-sources runs the catalog-driven UDF generator; test exercises the
+        # generated surface against libmeos through JMEOS.
+        run: mvn -B clean test
diff --git a/.gitignore b/.gitignore
@@ -14,3 +14,6 @@
 # Maven
 log/
 target/
+tools/__pycache__/
+/target/
+src/main/java/org/mobilitydb/spark/generated/
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "berlinmod/suite"]
+	path = berlinmod/suite
+	url = https://github.com/estebanzimanyi/berlinmod-portability.git
diff --git a/INTEGRATION_NOTES.md b/INTEGRATION_NOTES.md
@@ -0,0 +1,19 @@
+# INTEGRATION BRANCH NOTE — MEOS / JMEOS pin
+
+`integration/berlinmod-bench` builds against ecosystem pin
+**`ecosystem-pin-2026-06-11p`**.
+
+- **`libs/JMEOS-1.4.jar`** is the canonical JMEOS regen at that pin
+  (JMEOS PR #19): a single generated `functions.GeneratedFunctions`
+  surface. The legacy hand-rolled `functions.functions` facade is retired,
+  and every UDF binds the generated surface directly.
+- **`lib/libmeos.so`** is built from the pin with `-DH3=ON` and the
+  CBUFFER / NPOINT / POSE / RGEO families, so the th3index family is backed
+  with no build-time special-casing.
+- **CI** (`.github/workflows/maven.yml`) builds `libmeos` from the pin on
+  Linux and macOS (with H3). The Windows job is non-blocking until the
+  `MEOS_TZDATA_DIR` cmake option lands in the pin (it currently lives only on
+  the `meos-windows-bootstrap` branch); once folded, Windows repoints to the
+  pin like the other platforms.
+
+The full unit suite is green (907/907).
diff --git a/README.md b/README.md
@@ -16,6 +16,14 @@ The MobilityDB project is developed by the Computer & Decision Engineering Depar
 
 More information about MobilityDB, including publications, presentations, etc., can be found in the MobilityDB [website](https://mobilitydb.com).
 
+### For contributors and reviewers
+
+- Reviewing a pull request?  See the
+  [PR Reviewer Guide](doc/contributing/reviewer-guide.md) — tier ranking,
+  dependency chains and the standards checklist.  Reviewers landing in
+  any of the three platform repos (MobilityDB / MobilityDuck /
+  MobilitySpark) find the same canonical structure at the same path.
+
 
 ## Table of Contents
 
@@ -40,7 +48,7 @@ More information about MobilityDB, including publications, presentations, etc.,
 
 - 🚀 MobilityDB installed with MEOS
 - 🔧 JMEOS working version
-- ⚡ Spark 3.4.0
+- ⚡ Apache Spark 3.5.x (LTS); see [`doc/spark-version.md`](doc/spark-version.md) for the Spark-version target and the rationale for not yet supporting Spark 4
 - 📝 Maven 4
 - ☕ Java 17 (recommended)
 

diff --git a/berlinmod/README.md b/berlinmod/README.md
@@ -0,0 +1,102 @@
+# BerlinMOD benchmark — MobilitySpark
+
+This directory benchmarks the **canonical portable BerlinMOD suite** on
+Apache Spark via MobilitySpark's MEOS-backed UDFs.
+
+The queries, schema, and load script are **not** kept here — they live in the
+single canonical source, vendored as the `suite/` git submodule
+([`berlinmod-portability`](https://github.com/estebanzimanyi/berlinmod-portability),
+shared byte-for-byte with MobilityDB and MobilityDuck). This directory holds
+only the Spark-side runner, the data corpus, and the expected results.
+
+```
+berlinmod/
+  suite/        # submodule — the ONE canonical SQL (q01..q17, qrt, schema.sql,
+                #             portable_aliases.sql, data/load.sql)
+  data/         # CSV corpus loaded into Spark
+  expected/     # expected query results for correctness checks
+  bench/        # bench_mspark.sh (timing), report.py, chart.py
+  run_mspark.sh # correctness demo runner
+```
+
+After cloning, initialise the submodule:
+
+```bash
+git submodule update --init berlinmod/suite
+```
+
+---
+
+## Running
+
+**Timing benchmark** (`BerlinMODBench` — reads every query from `suite/`):
+
+```bash
+berlinmod/bench/bench_mspark.sh --data berlinmod/data --output bench/results/mspark.json
+# --quick = 1 run/query, --queries q05,q10 to select, --runs N to repeat
+```
+
+The runner sets `-Dberlinmod.sql.dir=berlinmod/suite`, so every query is read
+from the canonical submodule — there is no Spark-specific query variant and no
+SQL rewriting.
+
+**Benchmark runner** (`BerlinMODBench` — canonical berlinmod/suite queries + shared CSV data):
+
+```bash
+berlinmod/run_mspark.sh [spark-submit-binary]
+```
+
+---
+
+## Spatial joins on Spark
+
+Spark SQL has **no native spatial index**. The canonical queries pre-filter
+spatial joins with the bounding-box `&&` operator, which PostgreSQL (GiST) and
+DuckDB (TRTREE) accelerate with a spatial index. Spark cannot — it evaluates
+`&&` as an opaque UDF, so the index-less spatial joins **q10 / q11 / q12 / q14**
+degrade to a Cartesian product with a per-pair MEOS UDF (each call re-parses the
+trajectory from hex). On the full corpus these exceed any reasonable per-query
+budget; bound them with the caller's own `timeout`.
+
+The route to timings comparable to the indexed engines is the **th3index
+columnar prefilter**: explode each trip's th3index into one row per H3 cell and
+turn the overlap into an **equi-join on the cell column**, which Spark
+accelerates natively (hash join) — pruning candidate pairs before the exact
+`tDwithin`/`eDwithin` runs. th3index is portable (PG/DuckDB/Spark all compute it
+from the same MEOS function), so it is the Tier-1 acceleration all three share.
+
+---
+
+## Data
+
+`data/` holds the BerlinMOD corpus (`trips.csv` is git-ignored — it is large and
+regenerated). The CSVs are produced by
+[MobilityDB-BerlinMOD](https://github.com/MobilityDB/MobilityDB-BerlinMOD) via
+`berlinmod_portability_export()`:
+
+```sql
+-- In a PostgreSQL database with generated BerlinMOD data:
+\i BerlinMOD/berlinmod_export.sql
+-- args: path, H3 resolution, output SRID (3812 = ETRS89/Belgian Lambert 2008,
+-- Brussels — true metres; the export reprojects + SRID-tags everything here so
+-- no consumer ever reprojects)
+SELECT berlinmod_portability_export('/path/to/output/', 7, 3812);
+```
+
+This single call writes the complete dataset (`vehicles.csv`, `trips.csv` with
+hex-EWKB trips, `query_licences/instants/points/periods/regions.csv`) in the
+schema defined by `suite/schema.sql`. Drop them into `data/` and re-run — there
+is no per-tool post-processing.
+
+---
+
+## Cross-engine comparison
+
+Each tool benchmarks itself and emits a results JSON. The cross-engine
+comparison is assembled **offline** by merging the per-tool JSONs:
+
+```bash
+# Collect mspark.json (here) + mbdb.json / mduck.json (from those repos) into one dir, then:
+python3 bench/report.py --results <dir> --output <dir>/report.md
+python3 bench/chart.py  --results <dir> --output <dir>/chart.png --log   # --log: Spark spatial joins span orders of magnitude
+```
diff --git a/berlinmod/bench/.gitignore b/berlinmod/bench/.gitignore
@@ -0,0 +1,6 @@
+# Machine-specific benchmark results — do not commit
+results/*.json
+results/report.md
+
+# DuckDB scratch database
+/tmp/berlinmod_bench.duckdb
diff --git a/berlinmod/bench/README.md b/berlinmod/bench/README.md
@@ -0,0 +1,56 @@
+# MobilitySpark BerlinMOD benchmark runner
+
+Times the canonical BerlinMOD suite on Apache Spark. The queries come from the
+`berlinmod/suite/` submodule (the single canonical source); see
+[`../README.md`](../README.md) for context.
+
+---
+
+## Prerequisites
+
+```bash
+# Apache Spark 3.5.4 (~300 MB) + Maven
+sudo apt-get install maven
+bash ../../setup/install_spark.sh
+source ~/.bashrc
+
+# Build the fat JAR (once, or after code changes)
+cd ../.. && mvn package -DskipTests -q
+```
+
+## Run
+
+```bash
+# Full corpus, 3 runs/query, data from berlinmod/data/
+bash bench_mspark.sh --data ../data --runs 3 --output results/mspark.json
+
+# 1 run/query; select queries
+bash bench_mspark.sh --data ../data --quick --queries q05,q10
+```
+
+All queries run in a single `spark-submit` session (class `BerlinMODBench`);
+JVM startup is excluded from query timings. The runner reads every query from
+`berlinmod/suite/` — no Spark-specific variant. The index-less spatial joins
+(q10/q11/q12/q14) can exceed a per-query budget on the full corpus; wrap the
+call in `timeout` to bound them (see `../README.md`, "Spatial joins on Spark").
+
+## Cross-engine comparison (offline)
+
+Each tool emits its own `*.json`. To build a comparison table/chart, collect
+`mspark.json` here together with `mbdb.json` / `mduck.json` from those repos
+into one directory, then merge offline:
+
+```bash
+python3 report.py --results <dir> --output <dir>/report.md
+python3 chart.py  --results <dir> --output <dir>/chart.png --log
+```
+
+`chart.py` requires `matplotlib`. `--log` is advisable since Spark's spatial
+joins span several orders of magnitude.
+
+## Methodology
+
+- **Timing**: wall-clock milliseconds around each query, median of N runs; data
+  loading excluded.
+- **SQL**: the identical canonical `<query>.sql` from `berlinmod/suite/` —
+  named-function portable dialect, no operator symbols.