Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
b939673
Add the catalog-driven Spark UDF generator
estebanzimanyi Jun 12, 2026
4876e7d
Drive the full JMEOS surface from the catalog, organized by doxygen g…
estebanzimanyi Jun 13, 2026
de62e78
Generate the portable bare-name dispatch from the contract families
estebanzimanyi Jun 13, 2026
1604174
Make the generated-dispatch surface build cleanly
estebanzimanyi Jun 14, 2026
942f2ce
Wire build-time generation of the UDF surface (generate-sources)
estebanzimanyi Jun 14, 2026
6b0ddd7
Add runtime verification of the generated UDF surface
estebanzimanyi Jun 14, 2026
825b762
Generate the *_in parse functions (const char* -> Java String)
estebanzimanyi Jun 14, 2026
c3810cc
Broaden generated-surface verification to cbuffer + npoint families
estebanzimanyi Jun 14, 2026
ae456d8
Generate uint64_t functions + default codegen output to target/
estebanzimanyi Jun 14, 2026
dc2d2c5
Generate the full portable bare-name operator dispatch surface
estebanzimanyi Jun 14, 2026
45da8d5
Generate the *_as_hexwkb family (swallow size_out, map unsigned char)
estebanzimanyi Jun 14, 2026
dbb61d1
Generate the @sqlfn canonical MobilityDB SQL surface with overload di…
estebanzimanyi Jun 14, 2026
5fe096c
Re-vendor the catalog from the consolidated MEOS-API at pin 14h
estebanzimanyi Jun 14, 2026
a2e71f8
Run the generated-dispatch surface on Java 21 / Spark 3.5 under CI
estebanzimanyi Jun 14, 2026
fe488cc
Rebuild the BerlinMOD benchmark on the generated UDF surface
estebanzimanyi Jun 14, 2026
2b28850
Re-vendor the generator with the wrong-type-WKB crash + SQL-arity fixes
estebanzimanyi Jun 14, 2026
f87e5ee
Bump the CI libmeos pin to ecosystem-pin-2026-06-14i
estebanzimanyi Jun 14, 2026
4207128
Re-vendor the generator with the atTime time-restrict + ecosystem-TZ fix
estebanzimanyi Jun 14, 2026
5867635
Wire the bench loader to the canonical load.sql H3 build (ready for t…
estebanzimanyi Jun 14, 2026
f1a3579
Delete the committed BerlinMOD data copy; read the canonical data ins…
estebanzimanyi Jun 14, 2026
e8ae766
Drop the unused hand-registration helpers; the UDF surface is fully g…
estebanzimanyi Jun 14, 2026
42cf659
Advance to ecosystem-pin-2026-06-14l (catalog + JMEOS jar + CI pin)
estebanzimanyi Jun 14, 2026
4acfe4c
Advance to ecosystem-pin-2026-06-14m + generate the NxN array UDFs
estebanzimanyi Jun 15, 2026
1576233
Add a Spark-dialect overlay for the PG-only NxN array queries (q05/q0…
estebanzimanyi Jun 15, 2026
a8aa13f
Make the Spark-dialect NxN queries actually run (rows.field + bare di…
estebanzimanyi Jun 15, 2026
5e891c2
Advance to ecosystem-pin-2026-06-15a (canonical operator dialect)
estebanzimanyi Jun 15, 2026
d5ecfa8
Advance to ecosystem-pin-2026-06-15c (catalog + JMEOS jar + CI pin)
estebanzimanyi Jun 15, 2026
0d99c89
Advance to ecosystem-pin-2026-06-15d (h3 @sqlfn fix → geoToH3IndexSet)
estebanzimanyi Jun 15, 2026
b3f4a42
Advance pin to ecosystem-pin-2026-06-15e (npoint_test build fix)
estebanzimanyi Jun 15, 2026
51effeb
Fix the th3 spatial prefilter on Spark: EWKT/SRID geo parse, eEq set …
estebanzimanyi Jun 15, 2026
711ca81
Advance to ecosystem-pin-2026-06-15f (thread-safe geo_from_text; geoT…
estebanzimanyi Jun 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions .github/workflows/maven.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: Maven CI

on:
push:
branches: ["**"]
pull_request:

jobs:
# ── Linux ────────────────────────────────────────────────────────────────────
linux:
name: Build and test — Linux (Java 21 / Spark 3.5)
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Set up Java 21
uses: actions/setup-java@v4
with:
distribution: temurin
java-version: "21"
cache: maven

- name: Install MEOS build dependencies
run: |
sudo apt-get update -qq
sudo apt-get install -y \
cmake ninja-build \
libjson-c-dev libgeos-dev libproj-dev libgsl-dev libh3-dev

- name: Checkout MobilityDB source (for MEOS build)
uses: actions/checkout@v4
with:
# Ecosystem pin: the SAME commit the vendored catalog (tools/meos-idl.json)
# and the bundled JMEOS jar are generated against.
repository: estebanzimanyi/MobilityDB
ref: ecosystem-pin-2026-06-15f
path: MobilityDB-src

- name: Build and install libmeos.so
run: |
# The build dir lives inside MobilityDB-src so the vendored pgtypes headers
# ("../../meos/include/...") resolve against the source tree.
cmake -S MobilityDB-src -B MobilityDB-src/meos-build \
-G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DMEOS=ON \
-DCBUFFER=ON -DNPOINT=ON -DPOSE=ON -DRGEO=ON \
-DH3=ON \
-DH3_LIBRARY=/usr/lib/x86_64-linux-gnu/libh3.so \
-DH3_INCLUDE_DIR=/usr/include/h3
cmake --build MobilityDB-src/meos-build -j
sudo cmake --install MobilityDB-src/meos-build
echo "LD_LIBRARY_PATH=/usr/local/lib" >> "$GITHUB_ENV"

- name: Install the bundled JMEOS jar into the local repo
# The generator reads this jar's symbols at build time and the UDFs call it at
# runtime (org.jmeos:meos:1.0); it is not on Maven Central.
run: |
mvn -B install:install-file \
-Dfile=libs/JMEOS.jar \
-DgroupId=org.jmeos -DartifactId=meos -Dversion=1.0 -Dpackaging=jar

- name: Build + generate + unit tests
# generate-sources runs the catalog-driven UDF generator; test exercises the
# generated surface against libmeos through JMEOS.
run: mvn -B clean test
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,6 @@
# Maven
log/
target/
tools/__pycache__/
/target/
src/main/java/org/mobilitydb/spark/generated/
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "berlinmod/suite"]
path = berlinmod/suite
url = https://github.com/estebanzimanyi/berlinmod-portability.git
19 changes: 19 additions & 0 deletions INTEGRATION_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# INTEGRATION BRANCH NOTE — MEOS / JMEOS pin

`integration/berlinmod-bench` builds against ecosystem pin
**`ecosystem-pin-2026-06-11p`**.

- **`libs/JMEOS-1.4.jar`** is the canonical JMEOS regen at that pin
(JMEOS PR #19): a single generated `functions.GeneratedFunctions`
surface. The legacy hand-rolled `functions.functions` facade is retired,
and every UDF binds the generated surface directly.
- **`lib/libmeos.so`** is built from the pin with `-DH3=ON` and the
CBUFFER / NPOINT / POSE / RGEO families, so the th3index family is backed
with no build-time special-casing.
- **CI** (`.github/workflows/maven.yml`) builds `libmeos` from the pin on
Linux and macOS (with H3). The Windows job is non-blocking until the
`MEOS_TZDATA_DIR` cmake option lands in the pin (it currently lives only on
the `meos-windows-bootstrap` branch); once folded, Windows repoints to the
pin like the other platforms.

The full unit suite is green (907/907).
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,14 @@ The MobilityDB project is developed by the Computer & Decision Engineering Depar

More information about MobilityDB, including publications, presentations, etc., can be found in the MobilityDB [website](https://mobilitydb.com).

### For contributors and reviewers

- Reviewing a pull request? See the
[PR Reviewer Guide](doc/contributing/reviewer-guide.md) — tier ranking,
dependency chains and the standards checklist. Reviewers landing in
any of the three platform repos (MobilityDB / MobilityDuck /
MobilitySpark) find the same canonical structure at the same path.


## Table of Contents

Expand All @@ -40,7 +48,7 @@ More information about MobilityDB, including publications, presentations, etc.,

- 🚀 MobilityDB installed with MEOS
- 🔧 JMEOS working version
- ⚡ Spark 3.4.0
- ⚡ Apache Spark 3.5.x (LTS); see [`doc/spark-version.md`](doc/spark-version.md) for the Spark-version target and the rationale for not yet supporting Spark 4
- 📝 Maven 4
- ☕ Java 17 (recommended)

Expand Down
102 changes: 102 additions & 0 deletions berlinmod/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# BerlinMOD benchmark — MobilitySpark

This directory benchmarks the **canonical portable BerlinMOD suite** on
Apache Spark via MobilitySpark's MEOS-backed UDFs.

The queries, schema, and load script are **not** kept here — they live in the
single canonical source, vendored as the `suite/` git submodule
([`berlinmod-portability`](https://github.com/estebanzimanyi/berlinmod-portability),
shared byte-for-byte with MobilityDB and MobilityDuck). This directory holds
only the Spark-side runner, the data corpus, and the expected results.

```
berlinmod/
suite/ # submodule — the ONE canonical SQL (q01..q17, qrt, schema.sql,
# portable_aliases.sql, data/load.sql)
data/ # CSV corpus loaded into Spark
expected/ # expected query results for correctness checks
bench/ # bench_mspark.sh (timing), report.py, chart.py
run_mspark.sh # correctness demo runner
```

After cloning, initialise the submodule:

```bash
git submodule update --init berlinmod/suite
```

---

## Running

**Timing benchmark** (`BerlinMODBench` — reads every query from `suite/`):

```bash
berlinmod/bench/bench_mspark.sh --data berlinmod/data --output bench/results/mspark.json
# --quick = 1 run/query, --queries q05,q10 to select, --runs N to repeat
```

The runner sets `-Dberlinmod.sql.dir=berlinmod/suite`, so every query is read
from the canonical submodule — there is no Spark-specific query variant and no
SQL rewriting.

**Benchmark runner** (`BerlinMODBench` — canonical berlinmod/suite queries + shared CSV data):

```bash
berlinmod/run_mspark.sh [spark-submit-binary]
```

---

## Spatial joins on Spark

Spark SQL has **no native spatial index**. The canonical queries pre-filter
spatial joins with the bounding-box `&&` operator, which PostgreSQL (GiST) and
DuckDB (TRTREE) accelerate with a spatial index. Spark cannot — it evaluates
`&&` as an opaque UDF, so the index-less spatial joins **q10 / q11 / q12 / q14**
degrade to a Cartesian product with a per-pair MEOS UDF (each call re-parses the
trajectory from hex). On the full corpus these exceed any reasonable per-query
budget; bound them with the caller's own `timeout`.

The route to timings comparable to the indexed engines is the **th3index
columnar prefilter**: explode each trip's th3index into one row per H3 cell and
turn the overlap into an **equi-join on the cell column**, which Spark
accelerates natively (hash join) — pruning candidate pairs before the exact
`tDwithin`/`eDwithin` runs. th3index is portable (PG/DuckDB/Spark all compute it
from the same MEOS function), so it is the Tier-1 acceleration all three share.

---

## Data

`data/` holds the BerlinMOD corpus (`trips.csv` is git-ignored — it is large and
regenerated). The CSVs are produced by
[MobilityDB-BerlinMOD](https://github.com/MobilityDB/MobilityDB-BerlinMOD) via
`berlinmod_portability_export()`:

```sql
-- In a PostgreSQL database with generated BerlinMOD data:
\i BerlinMOD/berlinmod_export.sql
-- args: path, H3 resolution, output SRID (3812 = ETRS89/Belgian Lambert 2008,
-- Brussels — true metres; the export reprojects + SRID-tags everything here so
-- no consumer ever reprojects)
SELECT berlinmod_portability_export('/path/to/output/', 7, 3812);
```

This single call writes the complete dataset (`vehicles.csv`, `trips.csv` with
hex-EWKB trips, `query_licences/instants/points/periods/regions.csv`) in the
schema defined by `suite/schema.sql`. Drop them into `data/` and re-run — there
is no per-tool post-processing.

---

## Cross-engine comparison

Each tool benchmarks itself and emits a results JSON. The cross-engine
comparison is assembled **offline** by merging the per-tool JSONs:

```bash
# Collect mspark.json (here) + mbdb.json / mduck.json (from those repos) into one dir, then:
python3 bench/report.py --results <dir> --output <dir>/report.md
python3 bench/chart.py --results <dir> --output <dir>/chart.png --log # --log: Spark spatial joins span orders of magnitude
```
6 changes: 6 additions & 0 deletions berlinmod/bench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Machine-specific benchmark results — do not commit
results/*.json
results/report.md

# DuckDB scratch database
/tmp/berlinmod_bench.duckdb
56 changes: 56 additions & 0 deletions berlinmod/bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# MobilitySpark BerlinMOD benchmark runner

Times the canonical BerlinMOD suite on Apache Spark. The queries come from the
`berlinmod/suite/` submodule (the single canonical source); see
[`../README.md`](../README.md) for context.

---

## Prerequisites

```bash
# Apache Spark 3.5.4 (~300 MB) + Maven
sudo apt-get install maven
bash ../../setup/install_spark.sh
source ~/.bashrc

# Build the fat JAR (once, or after code changes)
cd ../.. && mvn package -DskipTests -q
```

## Run

```bash
# Full corpus, 3 runs/query, data from berlinmod/data/
bash bench_mspark.sh --data ../data --runs 3 --output results/mspark.json

# 1 run/query; select queries
bash bench_mspark.sh --data ../data --quick --queries q05,q10
```

All queries run in a single `spark-submit` session (class `BerlinMODBench`);
JVM startup is excluded from query timings. The runner reads every query from
`berlinmod/suite/` — no Spark-specific variant. The index-less spatial joins
(q10/q11/q12/q14) can exceed a per-query budget on the full corpus; wrap the
call in `timeout` to bound them (see `../README.md`, "Spatial joins on Spark").

## Cross-engine comparison (offline)

Each tool emits its own `*.json`. To build a comparison table/chart, collect
`mspark.json` here together with `mbdb.json` / `mduck.json` from those repos
into one directory, then merge offline:

```bash
python3 report.py --results <dir> --output <dir>/report.md
python3 chart.py --results <dir> --output <dir>/chart.png --log
```

`chart.py` requires `matplotlib`. `--log` is advisable since Spark's spatial
joins span several orders of magnitude.

## Methodology

- **Timing**: wall-clock milliseconds around each query, median of N runs; data
loading excluded.
- **SQL**: the identical canonical `<query>.sql` from `berlinmod/suite/` —
named-function portable dialect, no operator symbols.
Loading
Loading