Skip to content

GH-3902: new in-memory graph GraphMemIndexedSet#3903

Open
arne-bdt wants to merge 1 commit into
apache:mainfrom
arne-bdt:GH-3902-GraphMemIndexedSet
Open

GH-3902: new in-memory graph GraphMemIndexedSet#3903
arne-bdt wants to merge 1 commit into
apache:mainfrom
arne-bdt:GH-3902-GraphMemIndexedSet

Conversation

@arne-bdt
Copy link
Copy Markdown
Contributor

@arne-bdt arne-bdt commented May 6, 2026

GitHub issue resolved #3902

Pull request Description:

GH-3902: Add new in-memory graph GraphMemIndexedSet

Summary

This branch adds a new in-memory Graph implementation, GraphMemIndexedSet, alongside the existing GraphMemFast and GraphMemRoaring. It is architecturally a sibling of GraphMemRoaring (single flat triple set + three S/P/O indexes), but the RoaringBitmaps are replaced by plain int[]-backed index lists plus per-slot reverse-index arrays. The goal is a graph that keeps Roaring's strengths on bulk iteration and partial pattern matches, but uses less memory and runs faster on pattern lookups dominated by single-key streams.

The factory exposes it as:

Graph g = GraphMemFactory.createGraphMemIndexedSet();   // EAGER
Graph g = new GraphMemIndexedSet(IndexingStrategy.LAZY_PARALLEL);

The same IndexingStrategy enum that GraphMemRoaring uses (EAGER / LAZY / LAZY_PARALLEL / MANUAL / MINIMAL) is supported, so the bulk-load → initializeIndexParallel() pattern carries over.

As part of this change the four shared strategy classes (StoreStrategy, LazyStoreStrategy, ManualStoreStrategy, MinimalStoreStrategy) were lifted from mem.store.roaring.strategies into a new common package mem.store.strategies, and a tiny IndexedTripleSource abstraction was introduced so the new iterators/spliterators don't couple to a concrete triple set type.

Key changes

  • New org.apache.jena.mem.GraphMemIndexedSet with the same public surface as GraphMemRoaring (copy(), getIndexingStrategy(), initializeIndex(), initializeIndexParallel(), resetIndexingStrategy(), isIndexInitialized()).
  • New package org.apache.jena.mem.store.indexed containing:
    • IndexedSetTripleStoreTripleStore implementation that delegates to a configurable StoreStrategy.
    • TripleSetFastHashSet<Triple> with a grow-hook so parallel index arrays stay in lock-step.
    • IndexList – append-only int[] with ×1.5 growth, swap-with-last removal, and a static intersects() helper.
    • NodesToIndicesFastHashMap<Node, IndexList> plus a getOrNew(...) insertion fast-path.
    • EagerStoreStrategy – three NodesToIndices maps + three parallel int[] reverse-index arrays.
    • IndexListIterator / IndexListSpliterator – single-list iteration.
    • IndexListsIterator / IndexListsSpliterator – two-list intersection by reverse-index probe.
    • IndexedTripleSource – minimal size() / getTriples() view used by the iterators.
  • Refactor: shared strategy classes moved from mem.store.roaring.strategies to mem.store.strategies; RoaringTripleStore updated to import from the new location. Roaring also picks up a small IndexedTripleSource and uses the same iteration plumbing where it makes sense.
  • Factory: GraphMemFactory.createGraphMemIndexedSet() added; javadoc notes that GraphMemIndexedSet is intended to replace GraphMemRoaring in the future.
  • Tests: ~2 000 lines of new unit tests (GraphMemIndexedSetTest, IndexingStrategyTest, IndexListTest, IndexListIteratorTest, IndexListSpliteratorTest, IndexListsIteratorTest, IndexListsSpliteratorTest, IndexedSetTripleStoreTest, NodesToIndicesTest, TripleSetTest). The old TestGraphContainsAnything / TestGraphContainsTriple benchmarks were consolidated into a single, parameterized TestGraphContains covering all eight contains patterns.
  • JMH harness: TestGraphAdd, TestGraphCopy, TestGraphDelete, TestGraphFindAll…, TestGraphStream…, TestGraphFindByMatchAnd… and the Context / GraphTripleNodeHelperCurrent helpers were updated so all three implementations can be driven by the same parameter set.

Architectural decisions

  1. Same shape as GraphMemRoaring, different index payload. One canonical TripleSet owns every triple at a stable integer slot; three S/P/O maps point each node to the set of triple indices that mention it. Pattern dispatch and the StoreStrategy interface are unchanged, so all five IndexingStrategy variants (EAGER / LAZY / LAZY_PARALLEL / MANUAL / MINIMAL) carry across without duplication.
  2. int[] index lists instead of bitmaps. Each per-node entry is an append-only IndexList (raw int[], ×1.5 growth, swap-with-last removal). Bitmaps amortise well only when the per-node populations are large and dense; for the typical mix of "few predicates with many triples, many subjects/objects with a handful of triples" you find in real RDF, raw int arrays use less memory and stream faster.
  3. Parallel reverse-index arrays for O(1) removal. Each of sReverseIndices / pReverseIndices / oReverseIndices is an int[] sized to the canonical TripleSet's keys.length. For every triple slot it stores the position of that triple inside its node's IndexList. Removal is then O(1) (swap with last, patch the reverse index for the moved element). To keep the arrays in lock-step with the triple set, TripleSet exposes an onKeysGrowHook that EagerStoreStrategy uses to grow the three reverse arrays whenever the triple set re-hashes.
  4. Allocation-free intersection. Two-key patterns (SP_, S_O, _PO) are answered by walking the shorter IndexList and probing each triple index against the larger list's reverse-index (indicesLarger[reverseIndicesLarger[i]] == i). This is O(min(|A|, |B|)) with no temporary collection — in contrast to bitmap AND which materialises an intermediate bitmap. The same pattern is used for contains* (IndexList.intersects(...)) and for the streaming/iterator variants (IndexListsIterator, IndexListsSpliterator).
  5. IndexedTripleSource abstraction. A two-method interface (size(), Triple[] getTriples()) decouples the iterators/spliterators from any concrete triple-set type. TripleSet implements it directly.
  6. Strategy package consolidation. StoreStrategy, LazyStoreStrategy, ManualStoreStrategy, and MinimalStoreStrategy were moved from mem.store.roaring.strategies into a shared mem.store.strategies package. Both RoaringTripleStore and IndexedSetTripleStore now construct strategies from the same code, which is what makes "use any of the five IndexingStrategy variants" free for the new store.
  7. copy() reuses the index. When the source has an eager index built, the copy clones the three NodesToIndices maps and the three reverse-index arrays directly instead of rebuilding from the triples. This is what produces the ~2× speedup on copy against GraphMemFast in the benchmarks.

User-facing trade-offs

GraphMemFast GraphMemIndexedSet (new) GraphMemRoaring
Best at small/medium graphs, mixed workloads, point lookups bulk iteration, single-key streams, partial-pattern queries, large graphs very large graphs with dense per-node populations
Memory lowest on small graphs lowest on large graphs, otherwise within a few % of Fast consistently 15–30 % higher
Bulk iteration / find(ANY,ANY,ANY) slow (3 nested hash maps) ~2.5× faster than Fast ≈ IndexedSet
Point lookup contains(S,P,O) fastest ~1.7× slower than Fast ~1.7× slower than Fast
Two-key pattern (SP_,S_O,_PO) competitive fastest on most datasets competitive
Single-key stream (S__,_P_,__O) comparable comparable to Fast 10–12× slower than Fast
copy() baseline ~2× faster ≈ baseline
Lazy / manual / minimal indexing not supported supported supported
Bulk-load → build index in parallel not supported supported (initializeIndexParallel()) supported

For users

  • Drop-in replacement for GraphMemRoaring: same constructor / factory pattern, same IndexingStrategy knobs, same semantics. Switching is GraphMemFactory.createGraphMemRoaring()GraphMemFactory.createGraphMemIndexedSet().
  • Likely to beat Fast if your workload is dominated by find/stream over patterns with at least one wildcard, or by find(ANY,ANY,ANY) iteration.
  • Likely to lose to Fast on workloads dominated by contains(S,P,O) (fully concrete) point lookups — Fast's three S/P/O hash-set chains have a tighter early-exit path there. If your hot path is concrete contains, stay on GraphMemFast.
  • Use the bulk-load idiom when loading a large graph from disk: new GraphMemIndexedSet(IndexingStrategy.LAZY_PARALLEL) (or MANUAL + explicit initializeIndexParallel() after the load) gives you a fast load and a fully indexed graph afterwards.

For developers / maintainers

  • Pros
    • One canonical triple set is the single source of truth; the indexes are pure auxiliary state, which removes a whole class of consistency bugs compared with GraphMemFast's three parallel hash-of-hash structures.
    • No new runtime dependency: unlike GraphMemRoaring, the implementation is pure JDK (int[], Arrays, CompletableFuture). Easier to reason about, easier to profile, no external compatibility risk from the RoaringBitmap library.
    • Strategy package consolidation cuts ~300 lines of duplicated lazy/manual/minimal code and gives both stores the same configurability surface.
    • IndexedTripleSource carves out the seam that a future transactional / COW graph can slot into without re-writing iterators.
  • Cons
    • More moving parts than GraphMemFastTripleSet now has a grow-hook that must be wired up, the three reverse-index arrays must be sized in lock-step with the triple set's keys array, and IndexList.removeAt plus the strategy's removeIndex* must agree on the swap-with-last invariant. The new test suite locks these in.
    • Long-term direction is to deprecate GraphMemRoaring in favour of GraphMemIndexedSet (called out in the factory javadoc) once it has had bake time in the field — until then we are carrying both.

Performance (JMH, JDK 25, single fork, 4 to 5 warmup + 10 to15 measurement iterations, single-shot)

Geometric mean of IndexedSet / Fast and Roaring / Fast runtime across all seven datasets (bsbm-1m, bsbm-5m, bsbm-25m, RealGrid_EQ/SSH/SV, TestGrid_6750000). Lower is better. <1.0x means faster than GraphMemFast.

Category IndexedSet / Fast Roaring / Fast IndexedSet / Roaring
graphFindAll (full iteration via iterator) 0.17× 0.17× 1.04×
graphStream (full iteration via stream) 0.55× 0.62× 0.90×
graphStreamParallel 0.74× 0.69× 1.08×
copy 0.58× 0.85× 0.60×
graphDelete 0.80× 1.24× 0.65×
graphAdd 1.01× 1.07× 0.94×
contains(S,P,O) (point) 1.72× 1.72× 1.00×
contains single-key (S__,_P_,__O) 1.13× 1.20× 0.94×
contains two-key (SP_,S_O,_PO) 1.16× 1.77× 0.66×
findAndCount single-key 1.04× 1.94× 0.54×
findAndCount two-key 0.85× 1.82× 0.47×
graphStream single-key (S__,_P_,__O) 1.00× 5.77× 0.17×
graphStream two-key (SP_,S_O,_PO) 0.40× 0.67× 0.60×

JMH results as JSON files

You may drop these on https://jmh.morethan.io/ to get nice visualizations.

TestGraphStreamAll_20260525105812.json
TestGraphStreamAll_20260525034618.json
TestGraphFindByMatchAndCount_20260525103408.json
TestGraphFindByMatchAndCount_20260525040918.json
TestGraphFindAllWithForEachRemaining_20260525103115.json
TestGraphFindAllWithForEachRemaining_20260525014247.json
TestGraphDelete_20260525102101.json
TestGraphDelete_20260524225742.json
TestGraphCopy_20260525101855.json
TestGraphCopy_20260525040110.json
TestGraphContains_20260525094841.json
TestGraphContains_20260525001056.json
TestGraphAdd_20260525124525.json
TestGraphAdd_20260525094414.json
TestGraphAdd_20260525015003.json
TestGraphStreamByMatchAndCount_20260525110407.json
TestGraphStreamByMatchAndCount_20260525020722.json

Headline takeaways:

  • Full iteration is the biggest win: findAll is ~6× faster than GraphMemFast and on par with Roaring; stream and parallel stream are also clearly ahead of Fast. Both Roaring and IndexedSet pay off here because all triples live in a single flat array and iteration walks the array directly.
  • Two-key patterns (SP_, S_O, _PO) — the case Roaring was originally introduced for — IndexedSet matches or beats Roaring on every workload, and beats GraphMemFast by 15–60 % in geomean. The plain int-array intersection is faster than RoaringBitmap.and(...) for the populations typical in RDF.
  • Single-key streams (S__, _P_, __O) are the largest gap vs. Roaring: IndexedSet is on par with Fast, while Roaring is 6–18× slower in the worst cases (e.g. graphStream__O on bsbm-25m: Fast 0.58 s, IndexedSet 0.59 s, Roaring 10.6 s). Walking a dense bitmap of millions of indices is dramatically more expensive than walking an int[].
  • copy() is ~2× faster than Fast because the index can be cloned directly (no rebuild).
  • add and delete are essentially Fast's speed — within a couple of percent on most datasets, with delete actually winning by ~20 % on the bsbm graphs because swap-with-last is cheaper than removing from a FastHashSet chain.
  • The cost shows up on the fully-concrete contains(S,P,O): IndexedSet (and Roaring) need a hash lookup in the central TripleSet, whereas Fast can short-circuit through one of its three slot-specific hash chains. Both are ~1.7× slower there.

Memory consumption

Heap size of the populated graph (loaded once, measured after a forced GC):

Dataset GraphMemFast GraphMemIndexedSet GraphMemRoaring
bsbm-1m.nt.gz 64 MB 70 MB 80 MB
bsbm-5m.nt.gz 382 MB 364 MB 409 MB
bsbm-25m.nt.gz 1912 MB 1692 MB 1905 MB
cheeses-0.1.ttl 0.610 MB 0.575 MB 0.804 MB
pizza.owl.rdf 0.163 MB 0.168 MB 0.241 MB
RealGrid_EQ.rdf 74 MB 80 MB 103 MB
RealGrid_SSH.rdf 14.5 MB 13.8 MB 18.6 MB
RealGrid_SV.rdf 17 MB 18 MB 28 MB
TestGrid_6750000.rdf 560 MB 564 MB 726 MB
  • GraphMemIndexedSet is consistently smaller than GraphMemRoaring (15–30 % smaller on every dataset), because the per-node int[] index lists are denser than the equivalent Roaring containers for the relatively small per-node populations typical in RDF.
  • Vs. GraphMemFast: on the largest workloads (bsbm-5m, bsbm-25m) IndexedSet is actually the smallest of the three. On the smallest graphs (pizza, RealGrid_EQ/SSH/SV) Fast is still a few percent leaner — the three reverse-index int[]s sized to the triple-set's capacity show up as fixed overhead until the triples themselves dominate.

Test plan

  • New unit tests for IndexList, IndexListIterator, IndexListSpliterator, IndexListsIterator, IndexListsSpliterator, NodesToIndices, TripleSet, IndexedSetTripleStore, IndexingStrategy all five variants, and full GraphMemIndexedSetTest.
  • Existing GraphMemFast / GraphMemRoaring / GraphMemLegacy test suites continue to pass after the strategy package move.
  • JMH suite (TestGraphAdd, TestGraphCopy, TestGraphDelete, TestGraphFindAllWith*, TestGraphStream*, TestGraphFindByMatch*, TestGraphContains) extended with the new implementation; results above.

Disclaimer for AI usage

The production code is amost completly written by hand, using only GitHub Copilot auto-completion.
I let Claude Code generate a lot of the unit tests for the new classes.
I also let claude fix and update a lot of the JavaDoc.
-> I gave my best to review, understand and if needed fix every single generated line.
The PR description above is mainly generated by Claude Code.

Future work

The layout of GraphMemIndexedSet makes it an ideal candidate to build transactional graphs wich could be much faster than DatasetGraphMem. I already build a CoW prototype but I guess, MVCC would be the better pattern. Ideas and feedback are welcome.


  • Tests are included.
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@arne-bdt arne-bdt changed the title Gh 3902 graph mem indexed set Gh 3902 new in-memory graph GraphMemIndexedSet May 6, 2026
@arne-bdt arne-bdt force-pushed the GH-3902-GraphMemIndexedSet branch 3 times, most recently from 50001fe to 648db02 Compare May 7, 2026 12:51
@afs afs changed the title Gh 3902 new in-memory graph GraphMemIndexedSet GH-3902: new in-memory graph GraphMemIndexedSet May 12, 2026
@arne-bdt arne-bdt force-pushed the GH-3902-GraphMemIndexedSet branch 3 times, most recently from e1b4524 to 62287eb Compare May 25, 2026 16:08
GaphMemIndexedSet is based on the architecture of GraphMemRoaring, but replaces the RoaringBitmaps by simple index-lists and a reverse index.
@arne-bdt arne-bdt force-pushed the GH-3902-GraphMemIndexedSet branch from 62287eb to f0d1f5d Compare May 25, 2026 16:34
@arne-bdt arne-bdt marked this pull request as ready for review May 25, 2026 16:36
@afs
Copy link
Copy Markdown
Member

afs commented May 27, 2026

This looks exciting!

The layout of GraphMemIndexedSet makes it an ideal candidate to build transactional graphs which could be much faster than DatasetGraphMem

This looks very exciting!

@afs
Copy link
Copy Markdown
Member

afs commented May 27, 2026

Warnings:

The import org.apache.jena.mem.store.TripleStore is never used	GraphMemIndexedSet.java
	/jena-core/src/main/java/org/apache/jena/mem	line 24
Javadoc: Duplicate tag for return type	JenaSetIndexed.java
	/jena-core/src/main/java/org/apache/jena/mem/collection	line 43
The import org.apache.jena.mem.pattern.MatchPattern is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/indexed	line 26
The import org.apache.jena.mem.pattern.PatternClassifier is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/indexed	line 27
The import org.apache.jena.mem.pattern.MatchPattern is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/roaring	line 26
The import org.apache.jena.mem.pattern.PatternClassifier is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/roaring	line 27
The import org.roaringbitmap.ImmutableBitmapDataProvider is never used	EagerStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/roaring	line 32
The method getTriples() of type TripleSet should be tagged with @Override since it actually overrides a superinterface method	TripleSet.java
	/jena-core/src/main/java/org/apache/jena/mem/store/roaring	line 58
The import org.apache.jena.mem.pattern.MatchPattern is never used	LazyStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/strategies	line 26
The import org.apache.jena.mem.pattern.MatchPattern is never used	ManualStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/strategies	line 26
The import org.apache.jena.mem.pattern.MatchPattern is never used	MinimalStoreStrategy.java
	/jena-core/src/main/java/org/apache/jena/mem/store/strategies	line 27
The import org.apache.jena.mem.collection.FastHashSet is never used	RoaringBitmapTripleIteratorTest.java
	/jena-core/src/test/java/org/apache/jena/mem/store/roaring	line 24

@afs
Copy link
Copy Markdown
Member

afs commented May 27, 2026

I tried changing GraphMemFactory and ran mvn -fae clean install after tweaking jena-integration-tests to ignore indirect dependency on jena-ontapi.

No change from using GraphMemRoaring from main, so that's good.

jena-ontapi has some CME's as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New GraphMemIndexedSet

3 participants