fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450
fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450wanghan-iapcm wants to merge 1 commit into
Conversation
Single-rank LAMMPS .pt2 inference for a message-passing model (DPA2,
DPA3, hybrids over those) silently relied on LAMMPS atom-map to populate
``InputNlist.mapping`` — without ``atom_modify map yes`` the C++ side
fell into an identity-mapping fallback (``DeepPotPTExpt.cc:374-384``)
whose values are wrong for ghost slots ``[nloc, nall)``. The model
graph's ``_exchange_ghosts`` (``deepmd/dpmodel/descriptor/repformers.py``)
then performed ``take_along_axis(g1[1, nloc, dim], mapping_tiled)`` with
out-of-bounds gather indices for ghosts, producing a CUDA index assert
(reproduced by the user on a DPA4 model) or undefined CPU output.
Multi-rank LAMMPS without a with-comm AOTI artifact has the same class
of failure: ``pair_deepmd.cpp:243`` only populates ``lmp_list.mapping``
for ``nprocs == 1``, so the regular path always misses the ghost mapping
under multi-rank.
Both unsupported combinations now fail-fast with an actionable message
instead of silently corrupting ghost features.
Files:
* deepmd/pt_expt/utils/serialization.py — emit ``has_message_passing``
in .pt2 metadata, mirroring the descriptor's ``has_message_passing()``
API (true for DPA2 / DPA3 / hybrids over those; false for se_e2_a /
DPA1).
* source/api_cc/{include,src}/DeepPotPTExpt.{h,cc} and DeepSpinPTExpt
— read the metadata into ``has_message_passing_``, gate the fail-fast
on it so non-GNN models retain their previous behaviour. Refined
predicate:
``has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0``
Two error messages: single-rank ("add atom_modify map yes") and
multi-rank ("re-export with use_loc_mapping=False"). Defaults to
``false`` for pre-PR .pt2 archives that lack the field, so non-GNN
archives continue to work; GNN archives must be regenerated to opt
into the fail-fast guard.
* source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py — new
``--no-atom-map`` flag that omits ``atom_modify map yes`` from the
LAMMPS input.
* source/lmp/tests/test_lammps_dpa3_pt2.py — four-cell coverage matrix:
A : test_pair_deepmd (existing)
B : test_pair_deepmd_no_atom_map_fails_fast (new)
C : test_pair_deepmd_with_comm (new)
D : test_pair_deepmd_with_comm_no_atom_map_fails_fast (new)
C-mr: test_pair_deepmd_mpi_dpa3 (existing)
B-mr: test_pair_deepmd_mpi_no_with_comm_fails_fast (new)
D-mr: test_pair_deepmd_mpi_no_atom_map (new)
Investigation note: the ``test_deeppot_dpa_ptexpt.cc`` C++ ctest is
misleadingly named — despite the "Dpa" prefix it loads
``deeppot_dpa1.pt2`` (DPA1, non-message-passing), so its regular .pt2
graph never consumes ``mapping`` for ghost gather and identity fallback
is trivially safe. The genuinely-DPA2 ctest is
``test_deeppot_dpa2_ptexpt.cc``, which already explicitly sets
``inlist.mapping = mapping.data();`` on every ``cpu_lmp_nlist*`` case.
No C++ ctest fixtures need to be edited by this PR — the metadata-gated
fail-fast correctly skips non-message-passing models.
Local verification: 160/160 ptexpt ctests pass against freshly-regenerated
.pt2 fixtures (the new metadata field is written by all gen_*.py
scripts). The negative cells B/B-mr/D/D-mr fail-fast paths are
exercised only via the LAMMPS Python tests in CI.
Known limitations:
- Multi-rank DPA3 ``use_loc_mapping=True`` is permanently unsupported;
the fail-fast surfaces this clearly.
- Single-rank with-comm-artifact + no atom-map (cell D) could be made
to work by populating a synthetic self-mirror comm_dict; deferred.
- ``MPI_Comm_size`` is not used as the multi-rank predicate because
api_cc does not link MPI; ``lmp_list.nswap > 0`` serves as the proxy
(equivalent for all current LAMMPS configurations).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR extends the PT (PyTorch) model export and runtime to track message-passing descriptor behavior via a new ChangesMessage Passing Metadata Export and Dispatch Guard
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 917524709b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (has_message_passing_ && !use_with_comm && !atom_map_present && | ||
| nghost > 0) { |
There was a problem hiding this comment.
Fail fast for multi-rank regular path regardless atom-map presence
The new predicate only throws when !atom_map_present, so a multi-rank caller that does provide InputNlist.mapping can bypass this guard and still run the regular artifact (use_with_comm == false). In multi-rank runs, mapping lookups can validly resolve to ghost indices (>= nlocal), while the regular message-passing path gathers from local-only embeddings, which can still produce out-of-bounds indexing or corrupted forces. This leaves the original corruption class unblocked for a reachable configuration; the multi-rank fail-fast should not depend on whether a mapping pointer exists.
Useful? React with 👍 / 👎.
| # The C++ side gates its fail-fast on this — an absent mapping is | ||
| # fatal only for models that would silently corrupt ghost features | ||
| # otherwise. | ||
| desc = getattr(getattr(model, "atomic_model", None), "descriptor", None) |
There was a problem hiding this comment.
Non-blocking: this currently derives has_message_passing only from model.atomic_model.descriptor.has_message_passing(). That is fine for the normal DPA2/DPA3 export path, but it may under-report for future/alternate wrappers where the top-level model (or atomic_model) exposes has_message_passing() without a directly exposed descriptor. Would it be safer to first try model.has_message_passing() / model.atomic_model.has_message_passing() and only then fall back to atomic_model.descriptor.has_message_passing()?
— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)
| // model graph fills ghost features via border_op and ignores this | ||
| // tensor for ghost gather — see deepmd/pt_expt/descriptor/ | ||
| // repflows.py::_exchange_ghosts) or for trusted direct C++ callers | ||
| // (world == nullptr, see the dispatch carve-out above). Any other |
There was a problem hiding this comment.
Non-blocking: this comment still mentions world == nullptr and a dispatch carve-out, but the fail-fast predicate above does not actually special-case world == nullptr; a direct C++ caller with nghost > 0, no mapping, and no with-comm path will now throw too. That behavior may be exactly what we want, but the comment should match it to avoid future misreads.
— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5450 +/- ##
==========================================
- Coverage 82.48% 80.75% -1.73%
==========================================
Files 830 830
Lines 88522 88527 +5
Branches 4232 4231 -1
==========================================
- Hits 73015 71489 -1526
- Misses 14220 15913 +1693
+ Partials 1287 1125 -162 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
.pt2inference for message-passing models (DPA2, DPA3, hybrids over those) when the LAMMPS atom-map is not enabled. Previously the C++ side fell into an identity-mapping fallback (DeepPotPTExpt.cc:374-384) whose values are wrong for ghost slots; the model's_exchange_ghosts(deepmd/dpmodel/descriptor/repformers.py) then performedtake_along_axis(g1[1, nloc, dim], mapping_tiled)with out-of-bounds gather indices for ghosts — CUDA index assert in the user's DPA4 report, undefined CPU output otherwise.has_message_passingfield to .pt2 metadata (mirrors the descriptor'shas_message_passing()API: true for DPA2/DPA3/hybrids over those; false for se_e2_a/DPA1/etc.). Gate the fail-fast inDeepPotPTExpt::compute_innerandDeepSpinPTExpt::compute_inneron it. Non-GNN models retain their previous behaviour.atom_modify map yes…"use_loc_mapping=False…"has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0. Thenghost > 0guard skips NoPbc and isolated-cluster cases where identity over[0, nloc)is trivially correct.Four-cell coverage matrix in
test_lammps_dpa3_pt2.pyuse_loc_mappingtest_pair_deepmd(existing)test_pair_deepmd_no_atom_map_fails_fast(new)test_pair_deepmd_mpi_no_with_comm_fails_fast(new, subprocess)test_pair_deepmd_with_comm(new)border_op)test_pair_deepmd_mpi_dpa3(existing)test_pair_deepmd_with_comm_no_atom_map_fails_fast(new)test_pair_deepmd_mpi_no_atom_map(new, subprocess)Investigation note (resolves an earlier mystery)
test_deeppot_dpa_ptexpt.ccis misleadingly named — despite theDpaprefix it loadsdeeppot_dpa1.pt2(DPA1, non-message-passing). Its regular.pt2graph never consumesmappingfor ghost gather, so the identity fallback was trivially safe and the test passed without explicitinlist.mapping. The genuinely-DPA2 ctest istest_deeppot_dpa2_ptexpt.cc(different file), which already explicitly setsinlist.mapping = mapping.data();on allcpu_lmp_nlist*paths. No C++ ctest fixtures need editing in this PR — the metadata-gated fail-fast correctly skips DPA1.Backward compatibility
has_message_passing_defaults to false in C++ when the metadata field is missing — so pre-PR .pt2 archives retain their previous behaviour. Non-GNN pre-PR archives continue to work; GNN pre-PR archives must be regenerated to opt into the fail-fast guard. In-tree fixtures are generated bygen_*.pyat CI time, which always writes the new field.Test plan
*PtExpt*filter: 160 / 160 PASSED (270 s) against freshly-regenerated.pt2fixtures.pytest.raises(Exception, match=r\"atom_modify map yes\")and stdout/stderr substringuse_loc_mapping=False; if LAMMPS wraps the exception with a prefix/suffix differently than expected, the match may need adjustment.test_pair_deepmd_mpi_no_atom_map) verifies the with-comm artifact handles ghosts viaborder_opwithout consuming the mapping tensor.Known limitations
use_loc_mapping=Trueis permanently unsupported by this fix — the fail-fast surfaces it clearly, no path forward without re-export.comm_dict; deferred to a follow-up.MPI_Comm_sizeis not used as the multi-rank predicate becauseapi_ccdoes not link MPI directly;lmp_list.nswap > 0serves as the proxy (equivalent for all current LAMMPS configurations).use_loc_mapping=Truearchives lacking the new metadata field continue to exhibit the silent-corruption bug — users must regenerate.Summary by CodeRabbit
Bug Fixes
Tests