GH-3902: new in-memory graph GraphMemIndexedSet by arne-bdt · Pull Request #3903 · apache/jena

arne-bdt · 2026-05-06T14:29:15Z

GitHub issue resolved #3902

Pull request Description:

GH-3902: Add new in-memory graph `GraphMemIndexedSet`

Summary

This branch adds a new in-memory Graph implementation, GraphMemIndexedSet, alongside the existing GraphMemFast and GraphMemRoaring. It is architecturally a sibling of GraphMemRoaring (single flat triple set + three S/P/O indexes), but the RoaringBitmaps are replaced by plain int[]-backed index lists plus per-slot reverse-index arrays. The goal is a graph that keeps Roaring's strengths on bulk iteration and partial pattern matches, but uses less memory and runs faster on pattern lookups dominated by single-key streams.

The factory exposes it as:

Graph g = GraphMemFactory.createGraphMemIndexedSet();   // EAGER
Graph g = new GraphMemIndexedSet(IndexingStrategy.LAZY_PARALLEL);

The same IndexingStrategy enum that GraphMemRoaring uses (EAGER / LAZY / LAZY_PARALLEL / MANUAL / MINIMAL) is supported, so the bulk-load → initializeIndexParallel() pattern carries over.

As part of this change the four shared strategy classes (StoreStrategy, LazyStoreStrategy, ManualStoreStrategy, MinimalStoreStrategy) were lifted from mem.store.roaring.strategies into a new common package mem.store.strategies, and a tiny IndexedTripleSource abstraction was introduced so the new iterators/spliterators don't couple to a concrete triple set type.

Key changes

New org.apache.jena.mem.GraphMemIndexedSet with the same public surface as GraphMemRoaring (copy(), getIndexingStrategy(), initializeIndex(), initializeIndexParallel(), resetIndexingStrategy(), isIndexInitialized()).
New package org.apache.jena.mem.store.indexed containing:
- IndexedSetTripleStore – TripleStore implementation that delegates to a configurable StoreStrategy.
- TripleSet – FastHashSet<Triple> with a grow-hook so parallel index arrays stay in lock-step.
- IndexList – append-only int[] with ×1.5 growth, swap-with-last removal, and a static intersects() helper.
- NodesToIndices – FastHashMap<Node, IndexList> plus a getOrNew(...) insertion fast-path.
- EagerStoreStrategy – three NodesToIndices maps + three parallel int[] reverse-index arrays.
- IndexListIterator / IndexListSpliterator – single-list iteration.
- IndexListsIterator / IndexListsSpliterator – two-list intersection by reverse-index probe.
- IndexedTripleSource – minimal size() / getTriples() view used by the iterators.
Refactor: shared strategy classes moved from mem.store.roaring.strategies to mem.store.strategies; RoaringTripleStore updated to import from the new location. Roaring also picks up a small IndexedTripleSource and uses the same iteration plumbing where it makes sense.
Factory: GraphMemFactory.createGraphMemIndexedSet() added; javadoc notes that GraphMemIndexedSet is intended to replace GraphMemRoaring in the future.
Tests: ~2 000 lines of new unit tests (GraphMemIndexedSetTest, IndexingStrategyTest, IndexListTest, IndexListIteratorTest, IndexListSpliteratorTest, IndexListsIteratorTest, IndexListsSpliteratorTest, IndexedSetTripleStoreTest, NodesToIndicesTest, TripleSetTest). The old TestGraphContainsAnything / TestGraphContainsTriple benchmarks were consolidated into a single, parameterized TestGraphContains covering all eight contains patterns.
JMH harness: TestGraphAdd, TestGraphCopy, TestGraphDelete, TestGraphFindAll…, TestGraphStream…, TestGraphFindByMatchAnd… and the Context / GraphTripleNodeHelperCurrent helpers were updated so all three implementations can be driven by the same parameter set.

Architectural decisions

Same shape as GraphMemRoaring, different index payload. One canonical TripleSet owns every triple at a stable integer slot; three S/P/O maps point each node to the set of triple indices that mention it. Pattern dispatch and the StoreStrategy interface are unchanged, so all five IndexingStrategy variants (EAGER / LAZY / LAZY_PARALLEL / MANUAL / MINIMAL) carry across without duplication.
int[] index lists instead of bitmaps. Each per-node entry is an append-only IndexList (raw int[], ×1.5 growth, swap-with-last removal). Bitmaps amortise well only when the per-node populations are large and dense; for the typical mix of "few predicates with many triples, many subjects/objects with a handful of triples" you find in real RDF, raw int arrays use less memory and stream faster.
Parallel reverse-index arrays for O(1) removal. Each of sReverseIndices / pReverseIndices / oReverseIndices is an int[] sized to the canonical TripleSet's keys.length. For every triple slot it stores the position of that triple inside its node's IndexList. Removal is then O(1) (swap with last, patch the reverse index for the moved element). To keep the arrays in lock-step with the triple set, TripleSet exposes an onKeysGrowHook that EagerStoreStrategy uses to grow the three reverse arrays whenever the triple set re-hashes.
Allocation-free intersection. Two-key patterns (SP_, S_O, _PO) are answered by walking the shorter IndexList and probing each triple index against the larger list's reverse-index (indicesLarger[reverseIndicesLarger[i]] == i). This is O(min(|A|, |B|)) with no temporary collection — in contrast to bitmap AND which materialises an intermediate bitmap. The same pattern is used for contains* (IndexList.intersects(...)) and for the streaming/iterator variants (IndexListsIterator, IndexListsSpliterator).
IndexedTripleSource abstraction. A two-method interface (size(), Triple[] getTriples()) decouples the iterators/spliterators from any concrete triple-set type. TripleSet implements it directly.
Strategy package consolidation. StoreStrategy, LazyStoreStrategy, ManualStoreStrategy, and MinimalStoreStrategy were moved from mem.store.roaring.strategies into a shared mem.store.strategies package. Both RoaringTripleStore and IndexedSetTripleStore now construct strategies from the same code, which is what makes "use any of the five IndexingStrategy variants" free for the new store.
copy() reuses the index. When the source has an eager index built, the copy clones the three NodesToIndices maps and the three reverse-index arrays directly instead of rebuilding from the triples. This is what produces the ~2× speedup on copy against GraphMemFast in the benchmarks.

User-facing trade-offs

	`GraphMemFast`	`GraphMemIndexedSet` (new)	`GraphMemRoaring`
Best at	small/medium graphs, mixed workloads, point lookups	bulk iteration, single-key streams, partial-pattern queries, large graphs	very large graphs with dense per-node populations
Memory	lowest on small graphs	lowest on large graphs, otherwise within a few % of Fast	consistently 15–30 % higher
Bulk iteration / `find(ANY,ANY,ANY)`	slow (3 nested hash maps)	~2.5× faster than Fast	≈ IndexedSet
Point lookup `contains(S,P,O)`	fastest	~1.7× slower than Fast	~1.7× slower than Fast
Two-key pattern (`SP_`,`S_O`,`_PO`)	competitive	fastest on most datasets	competitive
Single-key stream (`S__`,`_P_`,`__O`)	comparable	comparable to Fast	10–12× slower than Fast
`copy()`	baseline	~2× faster	≈ baseline
Lazy / manual / minimal indexing	not supported	supported	supported
Bulk-load → build index in parallel	not supported	supported (`initializeIndexParallel()`)	supported

For users

Drop-in replacement for GraphMemRoaring: same constructor / factory pattern, same IndexingStrategy knobs, same semantics. Switching is GraphMemFactory.createGraphMemRoaring() → GraphMemFactory.createGraphMemIndexedSet().
Likely to beat Fast if your workload is dominated by find/stream over patterns with at least one wildcard, or by find(ANY,ANY,ANY) iteration.
Likely to lose to Fast on workloads dominated by contains(S,P,O) (fully concrete) point lookups — Fast's three S/P/O hash-set chains have a tighter early-exit path there. If your hot path is concrete contains, stay on GraphMemFast.
Use the bulk-load idiom when loading a large graph from disk: new GraphMemIndexedSet(IndexingStrategy.LAZY_PARALLEL) (or MANUAL + explicit initializeIndexParallel() after the load) gives you a fast load and a fully indexed graph afterwards.

For developers / maintainers

Pros
- One canonical triple set is the single source of truth; the indexes are pure auxiliary state, which removes a whole class of consistency bugs compared with GraphMemFast's three parallel hash-of-hash structures.
- No new runtime dependency: unlike GraphMemRoaring, the implementation is pure JDK (int[], Arrays, CompletableFuture). Easier to reason about, easier to profile, no external compatibility risk from the RoaringBitmap library.
- Strategy package consolidation cuts ~300 lines of duplicated lazy/manual/minimal code and gives both stores the same configurability surface.
- IndexedTripleSource carves out the seam that a future transactional / COW graph can slot into without re-writing iterators.
Cons
- More moving parts than GraphMemFast — TripleSet now has a grow-hook that must be wired up, the three reverse-index arrays must be sized in lock-step with the triple set's keys array, and IndexList.removeAt plus the strategy's removeIndex* must agree on the swap-with-last invariant. The new test suite locks these in.
- Long-term direction is to deprecate GraphMemRoaring in favour of GraphMemIndexedSet (called out in the factory javadoc) once it has had bake time in the field — until then we are carrying both.

Performance (JMH, JDK 25, single fork, 4 to 5 warmup + 10 to15 measurement iterations, single-shot)

Geometric mean of IndexedSet / Fast and Roaring / Fast runtime across all seven datasets (bsbm-1m, bsbm-5m, bsbm-25m, RealGrid_EQ/SSH/SV, TestGrid_6750000). Lower is better. <1.0x means faster than GraphMemFast.

Category	IndexedSet / Fast	Roaring / Fast	IndexedSet / Roaring
`graphFindAll` (full iteration via iterator)	0.17×	0.17×	1.04×
`graphStream` (full iteration via stream)	0.55×	0.62×	0.90×
`graphStreamParallel`	0.74×	0.69×	1.08×
`copy`	0.58×	0.85×	0.60×
`graphDelete`	0.80×	1.24×	0.65×
`graphAdd`	1.01×	1.07×	0.94×
`contains(S,P,O)` (point)	1.72×	1.72×	1.00×
`contains` single-key (`S__`,`_P_`,`__O`)	1.13×	1.20×	0.94×
`contains` two-key (`SP_`,`S_O`,`_PO`)	1.16×	1.77×	0.66×
`findAndCount` single-key	1.04×	1.94×	0.54×
`findAndCount` two-key	0.85×	1.82×	0.47×
`graphStream` single-key (`S__`,`_P_`,`__O`)	1.00×	5.77×	0.17×
`graphStream` two-key (`SP_`,`S_O`,`_PO`)	0.40×	0.67×	0.60×

JMH results as JSON files

You may drop these on https://jmh.morethan.io/ to get nice visualizations.

TestGraphStreamAll_20260525105812.json
TestGraphStreamAll_20260525034618.json
TestGraphFindByMatchAndCount_20260525103408.json
TestGraphFindByMatchAndCount_20260525040918.json
TestGraphFindAllWithForEachRemaining_20260525103115.json
TestGraphFindAllWithForEachRemaining_20260525014247.json
TestGraphDelete_20260525102101.json
TestGraphDelete_20260524225742.json
TestGraphCopy_20260525101855.json
TestGraphCopy_20260525040110.json
TestGraphContains_20260525094841.json
TestGraphContains_20260525001056.json
TestGraphAdd_20260525124525.json
TestGraphAdd_20260525094414.json
TestGraphAdd_20260525015003.json
TestGraphStreamByMatchAndCount_20260525110407.json
TestGraphStreamByMatchAndCount_20260525020722.json

Headline takeaways:

Full iteration is the biggest win: findAll is ~6× faster than GraphMemFast and on par with Roaring; stream and parallel stream are also clearly ahead of Fast. Both Roaring and IndexedSet pay off here because all triples live in a single flat array and iteration walks the array directly.
Two-key patterns (SP_, S_O, _PO) — the case Roaring was originally introduced for — IndexedSet matches or beats Roaring on every workload, and beats GraphMemFast by 15–60 % in geomean. The plain int-array intersection is faster than RoaringBitmap.and(...) for the populations typical in RDF.
Single-key streams (S__, _P_, __O) are the largest gap vs. Roaring: IndexedSet is on par with Fast, while Roaring is 6–18× slower in the worst cases (e.g. graphStream__O on bsbm-25m: Fast 0.58 s, IndexedSet 0.59 s, Roaring 10.6 s). Walking a dense bitmap of millions of indices is dramatically more expensive than walking an int[].
copy() is ~2× faster than Fast because the index can be cloned directly (no rebuild).
add and delete are essentially Fast's speed — within a couple of percent on most datasets, with delete actually winning by ~20 % on the bsbm graphs because swap-with-last is cheaper than removing from a FastHashSet chain.
The cost shows up on the fully-concrete contains(S,P,O): IndexedSet (and Roaring) need a hash lookup in the central TripleSet, whereas Fast can short-circuit through one of its three slot-specific hash chains. Both are ~1.7× slower there.

Memory consumption

Heap size of the populated graph (loaded once, measured after a forced GC):

Dataset	`GraphMemFast`	`GraphMemIndexedSet`	`GraphMemRoaring`
`bsbm-1m.nt.gz`	64 MB	70 MB	80 MB
`bsbm-5m.nt.gz`	382 MB	364 MB	409 MB
`bsbm-25m.nt.gz`	1912 MB	1692 MB	1905 MB
`cheeses-0.1.ttl`	0.610 MB	0.575 MB	0.804 MB
`pizza.owl.rdf`	0.163 MB	0.168 MB	0.241 MB
`RealGrid_EQ.rdf`	74 MB	80 MB	103 MB
`RealGrid_SSH.rdf`	14.5 MB	13.8 MB	18.6 MB
`RealGrid_SV.rdf`	17 MB	18 MB	28 MB
`TestGrid_6750000.rdf`	560 MB	564 MB	726 MB

GraphMemIndexedSet is consistently smaller than GraphMemRoaring (15–30 % smaller on every dataset), because the per-node int[] index lists are denser than the equivalent Roaring containers for the relatively small per-node populations typical in RDF.
Vs. GraphMemFast: on the largest workloads (bsbm-5m, bsbm-25m) IndexedSet is actually the smallest of the three. On the smallest graphs (pizza, RealGrid_EQ/SSH/SV) Fast is still a few percent leaner — the three reverse-index int[]s sized to the triple-set's capacity show up as fixed overhead until the triples themselves dominate.

Test plan

New unit tests for IndexList, IndexListIterator, IndexListSpliterator, IndexListsIterator, IndexListsSpliterator, NodesToIndices, TripleSet, IndexedSetTripleStore, IndexingStrategy all five variants, and full GraphMemIndexedSetTest.
Existing GraphMemFast / GraphMemRoaring / GraphMemLegacy test suites continue to pass after the strategy package move.
JMH suite (TestGraphAdd, TestGraphCopy, TestGraphDelete, TestGraphFindAllWith*, TestGraphStream*, TestGraphFindByMatch*, TestGraphContains) extended with the new implementation; results above.

Disclaimer for AI usage

The production code is amost completly written by hand, using only GitHub Copilot auto-completion.
I let Claude Code generate a lot of the unit tests for the new classes.
I also let claude fix and update a lot of the JavaDoc.
-> I gave my best to review, understand and if needed fix every single generated line.
The PR description above is mainly generated by Claude Code.

Future work

The layout of GraphMemIndexedSet makes it an ideal candidate to build transactional graphs wich could be much faster than DatasetGraphMem. I already build a CoW prototype but I guess, MVCC would be the better pattern. Ideas and feedback are welcome.

Tests are included.
Commits have been squashed to remove intermediate development commit messages.
Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.

See the Apache Jena "Contributing" guide.

GaphMemIndexedSet is based on the architecture of GraphMemRoaring, but replaces the RoaringBitmaps by simple index-lists and a reverse index.

afs · 2026-05-27T15:50:23Z

This looks exciting!

The layout of GraphMemIndexedSet makes it an ideal candidate to build transactional graphs which could be much faster than DatasetGraphMem

This looks very exciting!

afs · 2026-05-27T16:35:24Z

Warnings:

The import org.apache.jena.mem.store.TripleStore is never used	GraphMemIndexedSet.java
	/jena-core/src/main/java/org/apache/jena/mem	line 24
Javadoc: Duplicate tag for return type	JenaSetIndexed.java
	/jena-core/src/main/java/org/apache/jena/mem/collection	line 43
The import org.apache.jena.mem.pattern.MatchPattern is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/indexed	line 26
The import org.apache.jena.mem.pattern.PatternClassifier is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/indexed	line 27
The import org.apache.jena.mem.pattern.MatchPattern is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/roaring	line 26
The import org.apache.jena.mem.pattern.PatternClassifier is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/roaring	line 27
The import org.roaringbitmap.ImmutableBitmapDataProvider is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/roaring	line 32
The method getTriples() of type TripleSet should be tagged with @Override since it actually overrides a superinterface method	TripleSet.java
	/jena-core/src/main/java/org/apache/jena/mem/store/roaring	line 58
The import org.apache.jena.mem.pattern.MatchPattern is never used	LazyStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/strategies	line 26
The import org.apache.jena.mem.pattern.MatchPattern is never used	ManualStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/strategies	line 26
The import org.apache.jena.mem.pattern.MatchPattern is never used	MinimalStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/strategies	line 27
The import org.apache.jena.mem.collection.FastHashSet is never used	RoaringBitmapTripleIteratorTest.java
	/jena-core/src/test/java/org/apache/jena/mem/store/roaring	line 24

afs · 2026-05-27T17:49:18Z

I tried changing GraphMemFactory and ran mvn -fae clean install after tweaking jena-integration-tests to ignore indirect dependency on jena-ontapi.

No change from using GraphMemRoaring from main, so that's good.

jena-ontapi has some CME's as before.

arne-bdt changed the title ~~Gh 3902 graph mem indexed set~~ Gh 3902 new in-memory graph GraphMemIndexedSet May 6, 2026

arne-bdt force-pushed the GH-3902-GraphMemIndexedSet branch 3 times, most recently from 50001fe to 648db02 Compare May 7, 2026 12:51

afs changed the title ~~Gh 3902 new in-memory graph GraphMemIndexedSet~~ GH-3902: new in-memory graph GraphMemIndexedSet May 12, 2026

arne-bdt force-pushed the GH-3902-GraphMemIndexedSet branch 3 times, most recently from e1b4524 to 62287eb Compare May 25, 2026 16:08

apacheGH-3902: new GraphMemIndexedSet

f0d1f5d

GaphMemIndexedSet is based on the architecture of GraphMemRoaring, but replaces the RoaringBitmaps by simple index-lists and a reverse index.

arne-bdt force-pushed the GH-3902-GraphMemIndexedSet branch from 62287eb to f0d1f5d Compare May 25, 2026 16:34

arne-bdt marked this pull request as ready for review May 25, 2026 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3902: new in-memory graph GraphMemIndexedSet#3903

GH-3902: new in-memory graph GraphMemIndexedSet#3903
arne-bdt wants to merge 1 commit into
apache:mainfrom
arne-bdt:GH-3902-GraphMemIndexedSet

arne-bdt commented May 6, 2026 •

edited

Loading

Uh oh!

afs commented May 27, 2026

Uh oh!

afs commented May 27, 2026

Uh oh!

afs commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arne-bdt commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GH-3902: Add new in-memory graph GraphMemIndexedSet

Summary

Key changes

Architectural decisions

User-facing trade-offs

For users

For developers / maintainers

Performance (JMH, JDK 25, single fork, 4 to 5 warmup + 10 to15 measurement iterations, single-shot)

JMH results as JSON files

Memory consumption

Test plan

Disclaimer for AI usage

Future work

Uh oh!

afs commented May 27, 2026

Uh oh!

afs commented May 27, 2026

Uh oh!

afs commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arne-bdt commented May 6, 2026 •

edited

Loading

GH-3902: Add new in-memory graph `GraphMemIndexedSet`