fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map by wanghan-iapcm · Pull Request #5450 · deepmodeling/deepmd-kit

wanghan-iapcm · 2026-05-21T08:59:08Z

Summary

Surface a previously-silent corruption / CUDA index assert in LAMMPS .pt2 inference for message-passing models (DPA2, DPA3, hybrids over those) when the LAMMPS atom-map is not enabled. Previously the C++ side fell into an identity-mapping fallback (DeepPotPTExpt.cc:374-384) whose values are wrong for ghost slots; the model's _exchange_ghosts (deepmd/dpmodel/descriptor/repformers.py) then performed take_along_axis(g1[1, nloc, dim], mapping_tiled) with out-of-bounds gather indices for ghosts — CUDA index assert in the user's DPA4 report, undefined CPU output otherwise.
Add a has_message_passing field to .pt2 metadata (mirrors the descriptor's has_message_passing() API: true for DPA2/DPA3/hybrids over those; false for se_e2_a/DPA1/etc.). Gate the fail-fast in DeepPotPTExpt::compute_inner and DeepSpinPTExpt::compute_inner on it. Non-GNN models retain their previous behaviour.
Two error messages target the two distinct unsupported configurations:
- Single-rank without atom-map: "Single-rank LAMMPS .pt2 inference requires atom_modify map yes…"
- Multi-rank without a with-comm artifact: "Multi-rank LAMMPS .pt2 inference requires the model to be exported with use_loc_mapping=False…"
Refined predicate: has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0. The nghost > 0 guard skips NoPbc and isolated-cluster cases where identity over [0, nloc) is trivially correct.

Four-cell coverage matrix in `test_lammps_dpa3_pt2.py`

Cell	`use_loc_mapping`	atom-map	nprocs	Path	Test
A	True (regular only)	yes	1	regular w/ correct mapping	`test_pair_deepmd` (existing)
B	True	no	1	fail fast (single-rank msg)	`test_pair_deepmd_no_atom_map_fails_fast` (new)
B-mr	True	any	>1	fail fast (multi-rank msg)	`test_pair_deepmd_mpi_no_with_comm_fails_fast` (new, subprocess)
C	False (regular + with-comm)	yes	1	regular w/ atom-map	`test_pair_deepmd_with_comm` (new)
C-mr	False	any	>1	with-comm (`border_op`)	`test_pair_deepmd_mpi_dpa3` (existing)
D	False	no	1	fail fast (single-rank PBC can't drive border_op)	`test_pair_deepmd_with_comm_no_atom_map_fails_fast` (new)
D-mr	False	no	>1	with-comm (mapping-free)	`test_pair_deepmd_mpi_no_atom_map` (new, subprocess)

Investigation note (resolves an earlier mystery)

test_deeppot_dpa_ptexpt.cc is misleadingly named — despite the Dpa prefix it loads deeppot_dpa1.pt2 (DPA1, non-message-passing). Its regular .pt2 graph never consumes mapping for ghost gather, so the identity fallback was trivially safe and the test passed without explicit inlist.mapping. The genuinely-DPA2 ctest is test_deeppot_dpa2_ptexpt.cc (different file), which already explicitly sets inlist.mapping = mapping.data(); on all cpu_lmp_nlist* paths. No C++ ctest fixtures need editing in this PR — the metadata-gated fail-fast correctly skips DPA1.

Backward compatibility

has_message_passing_ defaults to false in C++ when the metadata field is missing — so pre-PR .pt2 archives retain their previous behaviour. Non-GNN pre-PR archives continue to work; GNN pre-PR archives must be regenerated to opt into the fail-fast guard. In-tree fixtures are generated by gen_*.py at CI time, which always writes the new field.

Test plan

Local C++ ctest *PtExpt* filter: 160 / 160 PASSED (270 s) against freshly-regenerated .pt2 fixtures.
CI runs the negative cells (B / B-mr / D) — they exercise the throw and verify the error-message substrings. The pytest assertions use pytest.raises(Exception, match=r\"atom_modify map yes\") and stdout/stderr substring use_loc_mapping=False; if LAMMPS wraps the exception with a prefix/suffix differently than expected, the match may need adjustment.
CI cell D-mr (test_pair_deepmd_mpi_no_atom_map) verifies the with-comm artifact handles ghosts via border_op without consuming the mapping tensor.

Known limitations

Multi-rank with use_loc_mapping=True is permanently unsupported by this fix — the fail-fast surfaces it clearly, no path forward without re-export.
Single-rank PBC + with-comm artifact + no atom-map (cell D) could be made to work via a synthesized self-mirror comm_dict; deferred to a follow-up.
MPI_Comm_size is not used as the multi-rank predicate because api_cc does not link MPI directly; lmp_list.nswap > 0 serves as the proxy (equivalent for all current LAMMPS configurations).
The pre-PR DPA3 use_loc_mapping=True archives lacking the new metadata field continue to exhibit the silent-corruption bug — users must regenerate.

Summary by CodeRabbit

Bug Fixes
- Added fail-fast validation for distributed computing scenarios to detect missing required atom mapping configurations earlier, with clear error messages for single-rank and multi-rank setups.
- Improved model execution dispatch logic to correctly route message-passing models in distributed environments.
Tests
- Extended test coverage for message-passing model variants and atom-mapping configurations across single-rank and multi-rank scenarios.

Single-rank LAMMPS .pt2 inference for a message-passing model (DPA2, DPA3, hybrids over those) silently relied on LAMMPS atom-map to populate ``InputNlist.mapping`` — without ``atom_modify map yes`` the C++ side fell into an identity-mapping fallback (``DeepPotPTExpt.cc:374-384``) whose values are wrong for ghost slots ``[nloc, nall)``. The model graph's ``_exchange_ghosts`` (``deepmd/dpmodel/descriptor/repformers.py``) then performed ``take_along_axis(g1[1, nloc, dim], mapping_tiled)`` with out-of-bounds gather indices for ghosts, producing a CUDA index assert (reproduced by the user on a DPA4 model) or undefined CPU output. Multi-rank LAMMPS without a with-comm AOTI artifact has the same class of failure: ``pair_deepmd.cpp:243`` only populates ``lmp_list.mapping`` for ``nprocs == 1``, so the regular path always misses the ghost mapping under multi-rank. Both unsupported combinations now fail-fast with an actionable message instead of silently corrupting ghost features. Files: * deepmd/pt_expt/utils/serialization.py — emit ``has_message_passing`` in .pt2 metadata, mirroring the descriptor's ``has_message_passing()`` API (true for DPA2 / DPA3 / hybrids over those; false for se_e2_a / DPA1). * source/api_cc/{include,src}/DeepPotPTExpt.{h,cc} and DeepSpinPTExpt — read the metadata into ``has_message_passing_``, gate the fail-fast on it so non-GNN models retain their previous behaviour. Refined predicate: ``has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0`` Two error messages: single-rank ("add atom_modify map yes") and multi-rank ("re-export with use_loc_mapping=False"). Defaults to ``false`` for pre-PR .pt2 archives that lack the field, so non-GNN archives continue to work; GNN archives must be regenerated to opt into the fail-fast guard. * source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py — new ``--no-atom-map`` flag that omits ``atom_modify map yes`` from the LAMMPS input. * source/lmp/tests/test_lammps_dpa3_pt2.py — four-cell coverage matrix: A : test_pair_deepmd (existing) B : test_pair_deepmd_no_atom_map_fails_fast (new) C : test_pair_deepmd_with_comm (new) D : test_pair_deepmd_with_comm_no_atom_map_fails_fast (new) C-mr: test_pair_deepmd_mpi_dpa3 (existing) B-mr: test_pair_deepmd_mpi_no_with_comm_fails_fast (new) D-mr: test_pair_deepmd_mpi_no_atom_map (new) Investigation note: the ``test_deeppot_dpa_ptexpt.cc`` C++ ctest is misleadingly named — despite the "Dpa" prefix it loads ``deeppot_dpa1.pt2`` (DPA1, non-message-passing), so its regular .pt2 graph never consumes ``mapping`` for ghost gather and identity fallback is trivially safe. The genuinely-DPA2 ctest is ``test_deeppot_dpa2_ptexpt.cc``, which already explicitly sets ``inlist.mapping = mapping.data();`` on every ``cpu_lmp_nlist*`` case. No C++ ctest fixtures need to be edited by this PR — the metadata-gated fail-fast correctly skips non-message-passing models. Local verification: 160/160 ptexpt ctests pass against freshly-regenerated .pt2 fixtures (the new metadata field is written by all gen_*.py scripts). The negative cells B/B-mr/D/D-mr fail-fast paths are exercised only via the LAMMPS Python tests in CI. Known limitations: - Multi-rank DPA3 ``use_loc_mapping=True`` is permanently unsupported; the fail-fast surfaces this clearly. - Single-rank with-comm-artifact + no atom-map (cell D) could be made to work by populating a synthetic self-mirror comm_dict; deferred. - ``MPI_Comm_size`` is not used as the multi-rank predicate because api_cc does not link MPI; ``lmp_list.nswap > 0`` serves as the proxy (equivalent for all current LAMMPS configurations).

coderabbitai · 2026-05-21T09:02:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c809ab41-dee8-4097-b137-53b7d4b727ea

📥 Commits

Reviewing files that changed from the base of the PR and between d3f08f3 and 9175247.

📒 Files selected for processing (7)

deepmd/pt_expt/utils/serialization.py
source/api_cc/include/DeepPotPTExpt.h
source/api_cc/include/DeepSpinPTExpt.h
source/api_cc/src/DeepPotPTExpt.cc
source/api_cc/src/DeepSpinPTExpt.cc
source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py
source/lmp/tests/test_lammps_dpa3_pt2.py

📝 Walkthrough

Walkthrough

This PR extends the PT (PyTorch) model export and runtime to track message-passing descriptor behavior via a new has_message_passing metadata flag. The flag is exported during serialization, loaded by DeepPotPTExpt and DeepSpinPTExpt at runtime, and used to control artifact selection (with-comm vs regular) and enforce fail-fast guards when mapping tensors are required but unavailable.

Changes

Message Passing Metadata Export and Dispatch Guard

Layer / File(s)	Summary
Metadata Export `deepmd/pt_expt/utils/serialization.py`	`_collect_metadata` now computes and exports `has_message_passing` by querying the descriptor's `has_message_passing()` method, with defensive fallback to `False` for missing or unsupported methods.
Runtime Data Structures `source/api_cc/include/DeepPotPTExpt.h`, `source/api_cc/include/DeepSpinPTExpt.h`	Both DeepPotPTExpt and DeepSpinPTExpt define a private `has_message_passing_` boolean member (default `false`) to store the exported metadata flag at runtime.
DeepPotPTExpt Initialization and Dispatch Logic `source/api_cc/src/DeepPotPTExpt.cc`	DeepPotPTExpt::init reads `has_message_passing` from metadata. In the LAMMPS compute path, dispatch logic moves earlier: `multi_rank` is derived from `lmp_list.nswap`, `use_with_comm` is selected based on artifact and multi-rank availability, and fail-fast exceptions are thrown with rank-specific messages when mapping is required but absent (with `nghost > 0`). Comments document safe fallback conditions, and inline `use_with_comm` recomputation is removed.
DeepSpinPTExpt Initialization and Dispatch Logic `source/api_cc/src/DeepSpinPTExpt.cc`	DeepSpinPTExpt mirrors the same flow: metadata reading, early dispatch selection based on `has_comm_artifact_` and multi-rank, fail-fast exceptions for missing with-comm + atom-map combinations, updated mapping fallback documentation, and reuse of earlier `use_with_comm` decision.
Test Infrastructure and Coverage `source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py`, `source/lmp/tests/test_lammps_dpa3_pt2.py`	LAMMPS runner adds `--no-atom-map` CLI flag. Test module adds `lammps_no_atom_map` fixture, updates `_lammps` helper with `atom_map` parameter, extends `_run_mpi_subprocess` with optional `pb_path` and `capture` parameters, and introduces single-rank fail-fast tests and MPI tests validating fail-fast errors and baseline-matching behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

deepmodeling/deepmd-kit#5354: Both PRs modify the PT archive metadata.json serialization flow at the code level—refactor(pt_expt): use model API for inference, consistent file naming #5354 refactors pt_expt/utils/serialization.py and C++ to use extra/metadata.json, while this PR extends that metadata with a new has_message_passing flag consumed by DeepPotPTExpt and DeepSpinPTExpt dispatch logic.

Suggested labels

Python, C++

Suggested reviewers

njzjz
anyangml

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding fail-fast error handling for .pt2 GNN inference when the LAMMPS atom-map is not enabled, which directly reflects the core fix across DeepPotPTExpt, DeepSpinPTExpt, and related test infrastructure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 917524709b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T09:08:15Z

+  if (has_message_passing_ && !use_with_comm && !atom_map_present &&
+      nghost > 0) {


Fail fast for multi-rank regular path regardless atom-map presence

The new predicate only throws when !atom_map_present, so a multi-rank caller that does provide InputNlist.mapping can bypass this guard and still run the regular artifact (use_with_comm == false). In multi-rank runs, mapping lookups can validly resolve to ghost indices (>= nlocal), while the regular message-passing path gathers from local-only embeddings, which can still produce out-of-bounds indexing or corrupted forces. This leaves the original corruption class unblocked for a reachable configuration; the multi-rank fail-fast should not depend on whether a mapping pointer exists.

Useful? React with 👍 / 👎.

njzjz-bot · 2026-05-21T09:28:33Z

+    # The C++ side gates its fail-fast on this — an absent mapping is
+    # fatal only for models that would silently corrupt ghost features
+    # otherwise.
+    desc = getattr(getattr(model, "atomic_model", None), "descriptor", None)


Non-blocking: this currently derives has_message_passing only from model.atomic_model.descriptor.has_message_passing(). That is fine for the normal DPA2/DPA3 export path, but it may under-report for future/alternate wrappers where the top-level model (or atomic_model) exposes has_message_passing() without a directly exposed descriptor. Would it be safer to first try model.has_message_passing() / model.atomic_model.has_message_passing() and only then fall back to atomic_model.descriptor.has_message_passing()?

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-21T09:28:35Z

+      // model graph fills ghost features via border_op and ignores this
+      // tensor for ghost gather — see deepmd/pt_expt/descriptor/
+      // repflows.py::_exchange_ghosts) or for trusted direct C++ callers
+      // (world == nullptr, see the dispatch carve-out above).  Any other


Non-blocking: this comment still mentions world == nullptr and a dispatch carve-out, but the fail-fast predicate above does not actually special-case world == nullptr; a direct C++ caller with nghost > 0, no mapping, and no with-comm path will now throw too. That behavior may be exactly what we want, but the comment should match it to avoid future misreads.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

codecov · 2026-05-21T09:59:20Z

Codecov Report

❌ Patch coverage is 52.17391% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.75%. Comparing base (d3f08f3) to head (9175247).

Files with missing lines	Patch %	Lines
source/api_cc/src/DeepPotPTExpt.cc	55.55%	3 Missing and 1 partial ⚠️
source/api_cc/src/DeepSpinPTExpt.cc	42.85%	4 Missing ⚠️
deepmd/pt_expt/utils/serialization.py	57.14%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5450      +/-   ##
==========================================
- Coverage   82.48%   80.75%   -1.73%     
==========================================
  Files         830      830              
  Lines       88522    88527       +5     
  Branches     4232     4231       -1     
==========================================
- Hits        73015    71489    -1526     
- Misses      14220    15913    +1693     
+ Partials     1287     1125     -162

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dosubot Bot added the bug label May 21, 2026

github-actions Bot added Python C++ LAMMPS labels May 21, 2026

wanghan-iapcm requested review from OutisLi and njzjz May 21, 2026 09:00

OutisLi approved these changes May 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

njzjz-bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450

fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-expt-lammps-no-atom-map

wanghan-iapcm commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

njzjz-bot May 21, 2026

Uh oh!

njzjz-bot May 21, 2026

Uh oh!

codecov Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if (has_message_passing_ && !use_with_comm && !atom_map_present &&
		nghost > 0) {

Conversation

wanghan-iapcm commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Four-cell coverage matrix in test_lammps_dpa3_pt2.py

Investigation note (resolves an earlier mystery)

Backward compatibility

Test plan

Known limitations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz-bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz-bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wanghan-iapcm commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Four-cell coverage matrix in `test_lammps_dpa3_pt2.py`

codecov Bot commented May 21, 2026 •

edited

Loading