GH-3902: new in-memory graph GraphMemIndexedSet#3903
Open
arne-bdt wants to merge 1 commit into
Open
Conversation
50001fe to
648db02
Compare
e1b4524 to
62287eb
Compare
GaphMemIndexedSet is based on the architecture of GraphMemRoaring, but replaces the RoaringBitmaps by simple index-lists and a reverse index.
62287eb to
f0d1f5d
Compare
Member
|
This looks exciting!
This looks very exciting! |
Member
|
Warnings: |
Member
|
I tried changing No change from using
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GitHub issue resolved #3902
Pull request Description:
GH-3902: Add new in-memory graph
GraphMemIndexedSetSummary
This branch adds a new in-memory
Graphimplementation,GraphMemIndexedSet, alongside the existingGraphMemFastandGraphMemRoaring. It is architecturally a sibling ofGraphMemRoaring(single flat triple set + three S/P/O indexes), but the RoaringBitmaps are replaced by plainint[]-backed index lists plus per-slot reverse-index arrays. The goal is a graph that keeps Roaring's strengths on bulk iteration and partial pattern matches, but uses less memory and runs faster on pattern lookups dominated by single-key streams.The factory exposes it as:
The same
IndexingStrategyenum thatGraphMemRoaringuses (EAGER/LAZY/LAZY_PARALLEL/MANUAL/MINIMAL) is supported, so the bulk-load →initializeIndexParallel()pattern carries over.As part of this change the four shared strategy classes (
StoreStrategy,LazyStoreStrategy,ManualStoreStrategy,MinimalStoreStrategy) were lifted frommem.store.roaring.strategiesinto a new common packagemem.store.strategies, and a tinyIndexedTripleSourceabstraction was introduced so the new iterators/spliterators don't couple to a concrete triple set type.Key changes
org.apache.jena.mem.GraphMemIndexedSetwith the same public surface asGraphMemRoaring(copy(),getIndexingStrategy(),initializeIndex(),initializeIndexParallel(),resetIndexingStrategy(),isIndexInitialized()).org.apache.jena.mem.store.indexedcontaining:IndexedSetTripleStore–TripleStoreimplementation that delegates to a configurableStoreStrategy.TripleSet–FastHashSet<Triple>with a grow-hook so parallel index arrays stay in lock-step.IndexList– append-onlyint[]with ×1.5 growth, swap-with-last removal, and a staticintersects()helper.NodesToIndices–FastHashMap<Node, IndexList>plus agetOrNew(...)insertion fast-path.EagerStoreStrategy– threeNodesToIndicesmaps + three parallelint[]reverse-index arrays.IndexListIterator/IndexListSpliterator– single-list iteration.IndexListsIterator/IndexListsSpliterator– two-list intersection by reverse-index probe.IndexedTripleSource– minimalsize()/getTriples()view used by the iterators.mem.store.roaring.strategiestomem.store.strategies;RoaringTripleStoreupdated to import from the new location. Roaring also picks up a smallIndexedTripleSourceand uses the same iteration plumbing where it makes sense.GraphMemFactory.createGraphMemIndexedSet()added; javadoc notes thatGraphMemIndexedSetis intended to replaceGraphMemRoaringin the future.GraphMemIndexedSetTest,IndexingStrategyTest,IndexListTest,IndexListIteratorTest,IndexListSpliteratorTest,IndexListsIteratorTest,IndexListsSpliteratorTest,IndexedSetTripleStoreTest,NodesToIndicesTest,TripleSetTest). The oldTestGraphContainsAnything/TestGraphContainsTriplebenchmarks were consolidated into a single, parameterizedTestGraphContainscovering all eightcontainspatterns.TestGraphAdd,TestGraphCopy,TestGraphDelete,TestGraphFindAll…,TestGraphStream…,TestGraphFindByMatchAnd…and theContext/GraphTripleNodeHelperCurrenthelpers were updated so all three implementations can be driven by the same parameter set.Architectural decisions
GraphMemRoaring, different index payload. One canonicalTripleSetowns every triple at a stable integer slot; three S/P/O maps point each node to the set of triple indices that mention it. Pattern dispatch and theStoreStrategyinterface are unchanged, so all fiveIndexingStrategyvariants (EAGER / LAZY / LAZY_PARALLEL / MANUAL / MINIMAL) carry across without duplication.int[]index lists instead of bitmaps. Each per-node entry is an append-onlyIndexList(rawint[], ×1.5 growth, swap-with-last removal). Bitmaps amortise well only when the per-node populations are large and dense; for the typical mix of "few predicates with many triples, many subjects/objects with a handful of triples" you find in real RDF, raw int arrays use less memory and stream faster.sReverseIndices/pReverseIndices/oReverseIndicesis anint[]sized to the canonicalTripleSet'skeys.length. For every triple slot it stores the position of that triple inside its node'sIndexList. Removal is thenO(1)(swap with last, patch the reverse index for the moved element). To keep the arrays in lock-step with the triple set,TripleSetexposes anonKeysGrowHookthatEagerStoreStrategyuses to grow the three reverse arrays whenever the triple set re-hashes.SP_,S_O,_PO) are answered by walking the shorterIndexListand probing each triple index against the larger list's reverse-index (indicesLarger[reverseIndicesLarger[i]] == i). This isO(min(|A|, |B|))with no temporary collection — in contrast to bitmapANDwhich materialises an intermediate bitmap. The same pattern is used forcontains*(IndexList.intersects(...)) and for the streaming/iterator variants (IndexListsIterator,IndexListsSpliterator).IndexedTripleSourceabstraction. A two-method interface (size(),Triple[] getTriples()) decouples the iterators/spliterators from any concrete triple-set type.TripleSetimplements it directly.StoreStrategy,LazyStoreStrategy,ManualStoreStrategy, andMinimalStoreStrategywere moved frommem.store.roaring.strategiesinto a sharedmem.store.strategiespackage. BothRoaringTripleStoreandIndexedSetTripleStorenow construct strategies from the same code, which is what makes "use any of the fiveIndexingStrategyvariants" free for the new store.copy()reuses the index. When the source has an eager index built, the copy clones the threeNodesToIndicesmaps and the three reverse-index arrays directly instead of rebuilding from the triples. This is what produces the~2×speedup oncopyagainstGraphMemFastin the benchmarks.User-facing trade-offs
GraphMemFastGraphMemIndexedSet(new)GraphMemRoaringfind(ANY,ANY,ANY)contains(S,P,O)SP_,S_O,_PO)S__,_P_,__O)copy()initializeIndexParallel())For users
GraphMemRoaring: same constructor / factory pattern, sameIndexingStrategyknobs, same semantics. Switching isGraphMemFactory.createGraphMemRoaring()→GraphMemFactory.createGraphMemIndexedSet().find/streamover patterns with at least one wildcard, or byfind(ANY,ANY,ANY)iteration.contains(S,P,O)(fully concrete) point lookups — Fast's three S/P/O hash-set chains have a tighter early-exit path there. If your hot path is concretecontains, stay onGraphMemFast.new GraphMemIndexedSet(IndexingStrategy.LAZY_PARALLEL)(or MANUAL + explicitinitializeIndexParallel()after the load) gives you a fast load and a fully indexed graph afterwards.For developers / maintainers
GraphMemFast's three parallel hash-of-hash structures.GraphMemRoaring, the implementation is pure JDK (int[],Arrays,CompletableFuture). Easier to reason about, easier to profile, no external compatibility risk from the RoaringBitmap library.IndexedTripleSourcecarves out the seam that a future transactional / COW graph can slot into without re-writing iterators.GraphMemFast—TripleSetnow has a grow-hook that must be wired up, the three reverse-index arrays must be sized in lock-step with the triple set's keys array, andIndexList.removeAtplus the strategy'sremoveIndex*must agree on the swap-with-last invariant. The new test suite locks these in.GraphMemRoaringin favour ofGraphMemIndexedSet(called out in the factory javadoc) once it has had bake time in the field — until then we are carrying both.Performance (JMH, JDK 25, single fork, 4 to 5 warmup + 10 to15 measurement iterations, single-shot)
Geometric mean of
IndexedSet / FastandRoaring / Fastruntime across all seven datasets (bsbm-1m,bsbm-5m,bsbm-25m,RealGrid_EQ/SSH/SV,TestGrid_6750000). Lower is better.<1.0xmeans faster thanGraphMemFast.graphFindAll(full iteration via iterator)graphStream(full iteration via stream)graphStreamParallelcopygraphDeletegraphAddcontains(S,P,O)(point)containssingle-key (S__,_P_,__O)containstwo-key (SP_,S_O,_PO)findAndCountsingle-keyfindAndCounttwo-keygraphStreamsingle-key (S__,_P_,__O)graphStreamtwo-key (SP_,S_O,_PO)JMH results as JSON files
You may drop these on https://jmh.morethan.io/ to get nice visualizations.
TestGraphStreamAll_20260525105812.json
TestGraphStreamAll_20260525034618.json
TestGraphFindByMatchAndCount_20260525103408.json
TestGraphFindByMatchAndCount_20260525040918.json
TestGraphFindAllWithForEachRemaining_20260525103115.json
TestGraphFindAllWithForEachRemaining_20260525014247.json
TestGraphDelete_20260525102101.json
TestGraphDelete_20260524225742.json
TestGraphCopy_20260525101855.json
TestGraphCopy_20260525040110.json
TestGraphContains_20260525094841.json
TestGraphContains_20260525001056.json
TestGraphAdd_20260525124525.json
TestGraphAdd_20260525094414.json
TestGraphAdd_20260525015003.json
TestGraphStreamByMatchAndCount_20260525110407.json
TestGraphStreamByMatchAndCount_20260525020722.json
Headline takeaways:
findAllis ~6× faster thanGraphMemFastand on par with Roaring;streamand parallelstreamare also clearly ahead of Fast. Both Roaring and IndexedSet pay off here because all triples live in a single flat array and iteration walks the array directly.SP_,S_O,_PO) — the case Roaring was originally introduced for — IndexedSet matches or beats Roaring on every workload, and beatsGraphMemFastby 15–60 % in geomean. The plain int-array intersection is faster thanRoaringBitmap.and(...)for the populations typical in RDF.S__,_P_,__O) are the largest gap vs. Roaring: IndexedSet is on par with Fast, while Roaring is 6–18× slower in the worst cases (e.g.graphStream__Oonbsbm-25m: Fast 0.58 s, IndexedSet 0.59 s, Roaring 10.6 s). Walking a dense bitmap of millions of indices is dramatically more expensive than walking anint[].copy()is ~2× faster than Fast because the index can be cloned directly (no rebuild).addanddeleteare essentially Fast's speed — within a couple of percent on most datasets, withdeleteactually winning by ~20 % on the bsbm graphs because swap-with-last is cheaper than removing from aFastHashSetchain.contains(S,P,O): IndexedSet (and Roaring) need a hash lookup in the centralTripleSet, whereas Fast can short-circuit through one of its three slot-specific hash chains. Both are ~1.7× slower there.Memory consumption
Heap size of the populated graph (loaded once, measured after a forced GC):
GraphMemFastGraphMemIndexedSetGraphMemRoaringbsbm-1m.nt.gzbsbm-5m.nt.gzbsbm-25m.nt.gzcheeses-0.1.ttlpizza.owl.rdfRealGrid_EQ.rdfRealGrid_SSH.rdfRealGrid_SV.rdfTestGrid_6750000.rdfGraphMemIndexedSetis consistently smaller thanGraphMemRoaring(15–30 % smaller on every dataset), because the per-nodeint[]index lists are denser than the equivalent Roaring containers for the relatively small per-node populations typical in RDF.GraphMemFast: on the largest workloads (bsbm-5m,bsbm-25m) IndexedSet is actually the smallest of the three. On the smallest graphs (pizza,RealGrid_EQ/SSH/SV) Fast is still a few percent leaner — the three reverse-indexint[]s sized to the triple-set's capacity show up as fixed overhead until the triples themselves dominate.Test plan
IndexList,IndexListIterator,IndexListSpliterator,IndexListsIterator,IndexListsSpliterator,NodesToIndices,TripleSet,IndexedSetTripleStore,IndexingStrategyall five variants, and fullGraphMemIndexedSetTest.GraphMemFast/GraphMemRoaring/GraphMemLegacytest suites continue to pass after the strategy package move.TestGraphAdd,TestGraphCopy,TestGraphDelete,TestGraphFindAllWith*,TestGraphStream*,TestGraphFindByMatch*,TestGraphContains) extended with the new implementation; results above.Disclaimer for AI usage
The production code is amost completly written by hand, using only GitHub Copilot auto-completion.
I let Claude Code generate a lot of the unit tests for the new classes.
I also let claude fix and update a lot of the JavaDoc.
-> I gave my best to review, understand and if needed fix every single generated line.
The PR description above is mainly generated by Claude Code.
Future work
The layout of
GraphMemIndexedSetmakes it an ideal candidate to build transactional graphs wich could be much faster thanDatasetGraphMem. I already build a CoW prototype but I guess, MVCC would be the better pattern. Ideas and feedback are welcome.By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.