Skip to content

fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450

Open
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-expt-lammps-no-atom-map
Open

fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-expt-lammps-no-atom-map

Conversation

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm commented May 21, 2026

Summary

  • Surface a previously-silent corruption / CUDA index assert in LAMMPS .pt2 inference for message-passing models (DPA2, DPA3, hybrids over those) when the LAMMPS atom-map is not enabled. Previously the C++ side fell into an identity-mapping fallback (DeepPotPTExpt.cc:374-384) whose values are wrong for ghost slots; the model's _exchange_ghosts (deepmd/dpmodel/descriptor/repformers.py) then performed take_along_axis(g1[1, nloc, dim], mapping_tiled) with out-of-bounds gather indices for ghosts — CUDA index assert in the user's DPA4 report, undefined CPU output otherwise.
  • Add a has_message_passing field to .pt2 metadata (mirrors the descriptor's has_message_passing() API: true for DPA2/DPA3/hybrids over those; false for se_e2_a/DPA1/etc.). Gate the fail-fast in DeepPotPTExpt::compute_inner and DeepSpinPTExpt::compute_inner on it. Non-GNN models retain their previous behaviour.
  • Two error messages target the two distinct unsupported configurations:
    • Single-rank without atom-map: "Single-rank LAMMPS .pt2 inference requires atom_modify map yes…"
    • Multi-rank without a with-comm artifact: "Multi-rank LAMMPS .pt2 inference requires the model to be exported with use_loc_mapping=False…"
  • Refined predicate: has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0. The nghost > 0 guard skips NoPbc and isolated-cluster cases where identity over [0, nloc) is trivially correct.

Four-cell coverage matrix in test_lammps_dpa3_pt2.py

Cell use_loc_mapping atom-map nprocs Path Test
A True (regular only) yes 1 regular w/ correct mapping test_pair_deepmd (existing)
B True no 1 fail fast (single-rank msg) test_pair_deepmd_no_atom_map_fails_fast (new)
B-mr True any >1 fail fast (multi-rank msg) test_pair_deepmd_mpi_no_with_comm_fails_fast (new, subprocess)
C False (regular + with-comm) yes 1 regular w/ atom-map test_pair_deepmd_with_comm (new)
C-mr False any >1 with-comm (border_op) test_pair_deepmd_mpi_dpa3 (existing)
D False no 1 fail fast (single-rank PBC can't drive border_op) test_pair_deepmd_with_comm_no_atom_map_fails_fast (new)
D-mr False no >1 with-comm (mapping-free) test_pair_deepmd_mpi_no_atom_map (new, subprocess)

Investigation note (resolves an earlier mystery)

test_deeppot_dpa_ptexpt.cc is misleadingly named — despite the Dpa prefix it loads deeppot_dpa1.pt2 (DPA1, non-message-passing). Its regular .pt2 graph never consumes mapping for ghost gather, so the identity fallback was trivially safe and the test passed without explicit inlist.mapping. The genuinely-DPA2 ctest is test_deeppot_dpa2_ptexpt.cc (different file), which already explicitly sets inlist.mapping = mapping.data(); on all cpu_lmp_nlist* paths. No C++ ctest fixtures need editing in this PR — the metadata-gated fail-fast correctly skips DPA1.

Backward compatibility

has_message_passing_ defaults to false in C++ when the metadata field is missing — so pre-PR .pt2 archives retain their previous behaviour. Non-GNN pre-PR archives continue to work; GNN pre-PR archives must be regenerated to opt into the fail-fast guard. In-tree fixtures are generated by gen_*.py at CI time, which always writes the new field.

Test plan

  • Local C++ ctest *PtExpt* filter: 160 / 160 PASSED (270 s) against freshly-regenerated .pt2 fixtures.
  • CI runs the negative cells (B / B-mr / D) — they exercise the throw and verify the error-message substrings. The pytest assertions use pytest.raises(Exception, match=r\"atom_modify map yes\") and stdout/stderr substring use_loc_mapping=False; if LAMMPS wraps the exception with a prefix/suffix differently than expected, the match may need adjustment.
  • CI cell D-mr (test_pair_deepmd_mpi_no_atom_map) verifies the with-comm artifact handles ghosts via border_op without consuming the mapping tensor.

Known limitations

  • Multi-rank with use_loc_mapping=True is permanently unsupported by this fix — the fail-fast surfaces it clearly, no path forward without re-export.
  • Single-rank PBC + with-comm artifact + no atom-map (cell D) could be made to work via a synthesized self-mirror comm_dict; deferred to a follow-up.
  • MPI_Comm_size is not used as the multi-rank predicate because api_cc does not link MPI directly; lmp_list.nswap > 0 serves as the proxy (equivalent for all current LAMMPS configurations).
  • The pre-PR DPA3 use_loc_mapping=True archives lacking the new metadata field continue to exhibit the silent-corruption bug — users must regenerate.

Summary by CodeRabbit

  • Bug Fixes

    • Added fail-fast validation for distributed computing scenarios to detect missing required atom mapping configurations earlier, with clear error messages for single-rank and multi-rank setups.
    • Improved model execution dispatch logic to correctly route message-passing models in distributed environments.
  • Tests

    • Extended test coverage for message-passing model variants and atom-mapping configurations across single-rank and multi-rank scenarios.

Review Change Stack

Single-rank LAMMPS .pt2 inference for a message-passing model (DPA2,
DPA3, hybrids over those) silently relied on LAMMPS atom-map to populate
``InputNlist.mapping`` — without ``atom_modify map yes`` the C++ side
fell into an identity-mapping fallback (``DeepPotPTExpt.cc:374-384``)
whose values are wrong for ghost slots ``[nloc, nall)``.  The model
graph's ``_exchange_ghosts`` (``deepmd/dpmodel/descriptor/repformers.py``)
then performed ``take_along_axis(g1[1, nloc, dim], mapping_tiled)`` with
out-of-bounds gather indices for ghosts, producing a CUDA index assert
(reproduced by the user on a DPA4 model) or undefined CPU output.

Multi-rank LAMMPS without a with-comm AOTI artifact has the same class
of failure: ``pair_deepmd.cpp:243`` only populates ``lmp_list.mapping``
for ``nprocs == 1``, so the regular path always misses the ghost mapping
under multi-rank.

Both unsupported combinations now fail-fast with an actionable message
instead of silently corrupting ghost features.

Files:

* deepmd/pt_expt/utils/serialization.py — emit ``has_message_passing``
  in .pt2 metadata, mirroring the descriptor's ``has_message_passing()``
  API (true for DPA2 / DPA3 / hybrids over those; false for se_e2_a /
  DPA1).
* source/api_cc/{include,src}/DeepPotPTExpt.{h,cc} and DeepSpinPTExpt
  — read the metadata into ``has_message_passing_``, gate the fail-fast
  on it so non-GNN models retain their previous behaviour.  Refined
  predicate:
    ``has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0``
  Two error messages: single-rank ("add atom_modify map yes") and
  multi-rank ("re-export with use_loc_mapping=False").  Defaults to
  ``false`` for pre-PR .pt2 archives that lack the field, so non-GNN
  archives continue to work; GNN archives must be regenerated to opt
  into the fail-fast guard.
* source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py — new
  ``--no-atom-map`` flag that omits ``atom_modify map yes`` from the
  LAMMPS input.
* source/lmp/tests/test_lammps_dpa3_pt2.py — four-cell coverage matrix:
    A   : test_pair_deepmd                              (existing)
    B   : test_pair_deepmd_no_atom_map_fails_fast       (new)
    C   : test_pair_deepmd_with_comm                    (new)
    D   : test_pair_deepmd_with_comm_no_atom_map_fails_fast  (new)
    C-mr: test_pair_deepmd_mpi_dpa3                     (existing)
    B-mr: test_pair_deepmd_mpi_no_with_comm_fails_fast  (new)
    D-mr: test_pair_deepmd_mpi_no_atom_map              (new)

Investigation note: the ``test_deeppot_dpa_ptexpt.cc`` C++ ctest is
misleadingly named — despite the "Dpa" prefix it loads
``deeppot_dpa1.pt2`` (DPA1, non-message-passing), so its regular .pt2
graph never consumes ``mapping`` for ghost gather and identity fallback
is trivially safe.  The genuinely-DPA2 ctest is
``test_deeppot_dpa2_ptexpt.cc``, which already explicitly sets
``inlist.mapping = mapping.data();`` on every ``cpu_lmp_nlist*`` case.
No C++ ctest fixtures need to be edited by this PR — the metadata-gated
fail-fast correctly skips non-message-passing models.

Local verification: 160/160 ptexpt ctests pass against freshly-regenerated
.pt2 fixtures (the new metadata field is written by all gen_*.py
scripts).  The negative cells B/B-mr/D/D-mr fail-fast paths are
exercised only via the LAMMPS Python tests in CI.

Known limitations:
- Multi-rank DPA3 ``use_loc_mapping=True`` is permanently unsupported;
  the fail-fast surfaces this clearly.
- Single-rank with-comm-artifact + no atom-map (cell D) could be made
  to work by populating a synthetic self-mirror comm_dict; deferred.
- ``MPI_Comm_size`` is not used as the multi-rank predicate because
  api_cc does not link MPI; ``lmp_list.nswap > 0`` serves as the proxy
  (equivalent for all current LAMMPS configurations).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c809ab41-dee8-4097-b137-53b7d4b727ea

📥 Commits

Reviewing files that changed from the base of the PR and between d3f08f3 and 9175247.

📒 Files selected for processing (7)
  • deepmd/pt_expt/utils/serialization.py
  • source/api_cc/include/DeepPotPTExpt.h
  • source/api_cc/include/DeepSpinPTExpt.h
  • source/api_cc/src/DeepPotPTExpt.cc
  • source/api_cc/src/DeepSpinPTExpt.cc
  • source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py
  • source/lmp/tests/test_lammps_dpa3_pt2.py

📝 Walkthrough

Walkthrough

This PR extends the PT (PyTorch) model export and runtime to track message-passing descriptor behavior via a new has_message_passing metadata flag. The flag is exported during serialization, loaded by DeepPotPTExpt and DeepSpinPTExpt at runtime, and used to control artifact selection (with-comm vs regular) and enforce fail-fast guards when mapping tensors are required but unavailable.

Changes

Message Passing Metadata Export and Dispatch Guard

Layer / File(s) Summary
Metadata Export
deepmd/pt_expt/utils/serialization.py
_collect_metadata now computes and exports has_message_passing by querying the descriptor's has_message_passing() method, with defensive fallback to False for missing or unsupported methods.
Runtime Data Structures
source/api_cc/include/DeepPotPTExpt.h, source/api_cc/include/DeepSpinPTExpt.h
Both DeepPotPTExpt and DeepSpinPTExpt define a private has_message_passing_ boolean member (default false) to store the exported metadata flag at runtime.
DeepPotPTExpt Initialization and Dispatch Logic
source/api_cc/src/DeepPotPTExpt.cc
DeepPotPTExpt::init reads has_message_passing from metadata. In the LAMMPS compute path, dispatch logic moves earlier: multi_rank is derived from lmp_list.nswap, use_with_comm is selected based on artifact and multi-rank availability, and fail-fast exceptions are thrown with rank-specific messages when mapping is required but absent (with nghost > 0). Comments document safe fallback conditions, and inline use_with_comm recomputation is removed.
DeepSpinPTExpt Initialization and Dispatch Logic
source/api_cc/src/DeepSpinPTExpt.cc
DeepSpinPTExpt mirrors the same flow: metadata reading, early dispatch selection based on has_comm_artifact_ and multi-rank, fail-fast exceptions for missing with-comm + atom-map combinations, updated mapping fallback documentation, and reuse of earlier use_with_comm decision.
Test Infrastructure and Coverage
source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py, source/lmp/tests/test_lammps_dpa3_pt2.py
LAMMPS runner adds --no-atom-map CLI flag. Test module adds lammps_no_atom_map fixture, updates _lammps helper with atom_map parameter, extends _run_mpi_subprocess with optional pb_path and capture parameters, and introduces single-rank fail-fast tests and MPI tests validating fail-fast errors and baseline-matching behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

Python, C++

Suggested reviewers

  • njzjz
  • anyangml
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding fail-fast error handling for .pt2 GNN inference when the LAMMPS atom-map is not enabled, which directly reflects the core fix across DeepPotPTExpt, DeepSpinPTExpt, and related test infrastructure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 917524709b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +392 to +393
if (has_message_passing_ && !use_with_comm && !atom_map_present &&
nghost > 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail fast for multi-rank regular path regardless atom-map presence

The new predicate only throws when !atom_map_present, so a multi-rank caller that does provide InputNlist.mapping can bypass this guard and still run the regular artifact (use_with_comm == false). In multi-rank runs, mapping lookups can validly resolve to ghost indices (>= nlocal), while the regular message-passing path gathers from local-only embeddings, which can still produce out-of-bounds indexing or corrupted forces. This leaves the original corruption class unblocked for a reachable configuration; the multi-rank fail-fast should not depend on whether a mapping pointer exists.

Useful? React with 👍 / 👎.

# The C++ side gates its fail-fast on this — an absent mapping is
# fatal only for models that would silently corrupt ghost features
# otherwise.
desc = getattr(getattr(model, "atomic_model", None), "descriptor", None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: this currently derives has_message_passing only from model.atomic_model.descriptor.has_message_passing(). That is fine for the normal DPA2/DPA3 export path, but it may under-report for future/alternate wrappers where the top-level model (or atomic_model) exposes has_message_passing() without a directly exposed descriptor. Would it be safer to first try model.has_message_passing() / model.atomic_model.has_message_passing() and only then fall back to atomic_model.descriptor.has_message_passing()?

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

// model graph fills ghost features via border_op and ignores this
// tensor for ghost gather — see deepmd/pt_expt/descriptor/
// repflows.py::_exchange_ghosts) or for trusted direct C++ callers
// (world == nullptr, see the dispatch carve-out above). Any other
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: this comment still mentions world == nullptr and a dispatch carve-out, but the fail-fast predicate above does not actually special-case world == nullptr; a direct C++ caller with nghost > 0, no mapping, and no with-comm path will now throw too. That behavior may be exactly what we want, but the comment should match it to avoid future misreads.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 52.17391% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.75%. Comparing base (d3f08f3) to head (9175247).

Files with missing lines Patch % Lines
source/api_cc/src/DeepPotPTExpt.cc 55.55% 3 Missing and 1 partial ⚠️
source/api_cc/src/DeepSpinPTExpt.cc 42.85% 4 Missing ⚠️
deepmd/pt_expt/utils/serialization.py 57.14% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5450      +/-   ##
==========================================
- Coverage   82.48%   80.75%   -1.73%     
==========================================
  Files         830      830              
  Lines       88522    88527       +5     
  Branches     4232     4231       -1     
==========================================
- Hits        73015    71489    -1526     
- Misses      14220    15913    +1693     
+ Partials     1287     1125     -162     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants