-
Notifications
You must be signed in to change notification settings - Fork 612
fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map #5450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -172,6 +172,18 @@ void DeepPotPTExpt::init(const std::string& model, | |
| // exchange and producing wrong results. | ||
| has_comm_artifact_ = metadata.obj_val.count("has_comm_artifact") && | ||
| metadata["has_comm_artifact"].as_bool(); | ||
| // Whether the regular .pt2 graph consumes ``mapping`` for ghost-atom | ||
| // feature gather. Mirrors the descriptor's ``has_message_passing()`` | ||
| // API: true for message-passing descriptors (DPA2, DPA3, hybrids | ||
| // over those), false for non-message-passing descriptors (se_e2_a, | ||
| // DPA1, etc.). Pre-PR .pt2 archives lack this field; default to | ||
| // false so they retain their previous behaviour (non-GNN archives | ||
| // continue to work; GNN archives that had the original | ||
| // silent-corruption bug must be regenerated to opt into the fail- | ||
| // fast guard). All in-tree fixtures are regenerated by the gen | ||
| // scripts and carry the explicit value. | ||
| has_message_passing_ = metadata.obj_val.count("has_message_passing") && | ||
| metadata["has_message_passing"].as_bool(); | ||
| if (has_comm_artifact_) { | ||
| try { | ||
| // Extract the nested ``extra/forward_lower_with_comm.pt2`` into a | ||
|
|
@@ -353,6 +365,49 @@ void DeepPotPTExpt::compute(ENERGYVTYPE& ener, | |
| .clone() | ||
| .to(device); | ||
|
|
||
| // Dispatch decision: use the with-comm artifact when LAMMPS is running | ||
| // multi-rank. ``lmp_list.nswap > 0`` is the proxy for "multi-rank with | ||
| // cross-domain communication"; in single-rank LAMMPS (processors 1 1 1, | ||
| // including with PBC) the C++ side sees nswap == 0. api_cc is not | ||
| // linked against MPI directly, so we cannot call MPI_Comm_size; the | ||
| // proxy is set by LAMMPS's CommBrick at setup time. | ||
| // | ||
| // The regular artifact uses ``mapping`` to gather ghost-atom features | ||
| // from local-atom embeddings (``index_select(node_ebd[1, nloc, dim], | ||
| // mapping)``). Identity-mapping for ghost slots is silently wrong, | ||
| // so fail-fast when the regular path would be taken without a real | ||
| // mapping — applies uniformly to every caller (LAMMPS pair, ctest | ||
| // fixtures, direct C++ API users). Callers that want the regular | ||
| // path must populate ``lmp_list.mapping``. | ||
| bool multi_rank = (lmp_list.nswap > 0); | ||
| bool atom_map_present = (lmp_list.mapping != nullptr); | ||
| bool use_with_comm = has_comm_artifact_ && multi_rank; | ||
| // Fail-fast conditions: | ||
| // - ``has_message_passing_``: only models whose regular graph | ||
| // actually consumes ``mapping`` for ghost-feature gather can be | ||
| // silently corrupted by an absent mapping. Skip for non-GNN | ||
| // models (se_e2_a, DPA1, ...). | ||
| // - ``nghost > 0``: with no ghost atoms, identity mapping over | ||
| // [0, nloc) is trivially correct. | ||
| if (has_message_passing_ && !use_with_comm && !atom_map_present && | ||
| nghost > 0) { | ||
|
Comment on lines
+392
to
+393
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The new predicate only throws when Useful? React with 👍 / 👎. |
||
| if (multi_rank) { | ||
| throw deepmd::deepmd_exception( | ||
| "Multi-rank LAMMPS .pt2 inference requires the model to be " | ||
| "exported with `use_loc_mapping=False`, which compiles a " | ||
| "with-comm artifact for cross-rank ghost-feature exchange. " | ||
| "Re-export the model with use_loc_mapping=False and try again."); | ||
| } else { | ||
| throw deepmd::deepmd_exception( | ||
| "Single-rank LAMMPS .pt2 inference requires `atom_modify map " | ||
| "yes` in the LAMMPS input (so InputNlist.mapping is populated " | ||
| "from the LAMMPS atom-map). The model gathers ghost-atom " | ||
| "features via this mapping; without it the C++ side has no " | ||
| "safe way to resolve ghost indices to local owners. C++ API " | ||
| "callers must set inlist.mapping explicitly before compute()."); | ||
| } | ||
| } | ||
|
|
||
| // LAMMPS sets ago=0 on every nlist rebuild (neighbor rebuild, re-partition, | ||
| // atom exchange between subdomains), so `ago > 0` implies the cached | ||
| // mapping and nlist tensors are still valid. Rebuild only on ago==0. | ||
|
|
@@ -372,7 +427,13 @@ void DeepPotPTExpt::compute(ENERGYVTYPE& ener, | |
| .clone() | ||
| .to(device); | ||
| } else { | ||
| // Default identity mapping for local atoms | ||
| // Identity fallback reached only on the with-comm path (where the | ||
| // model graph fills ghost features via border_op and ignores this | ||
| // tensor for ghost gather — see deepmd/pt_expt/descriptor/ | ||
| // repflows.py::_exchange_ghosts) or for trusted direct C++ callers | ||
| // (world == nullptr, see the dispatch carve-out above). Any other | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non-blocking: this comment still mentions — OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5) |
||
| // path that reaches here would have been rejected by the fail-fast | ||
| // throw, so identity values are safe. | ||
| std::vector<std::int64_t> mapping(nall_real); | ||
| for (int ii = 0; ii < nall_real; ii++) { | ||
| mapping[ii] = ii; | ||
|
|
@@ -428,14 +489,11 @@ void DeepPotPTExpt::compute(ENERGYVTYPE& ener, | |
| aparam_tensor = torch::zeros({0}, options).to(device); | ||
| } | ||
|
|
||
| // Phase 4 dispatch: use the with-comm artifact when LAMMPS is | ||
| // running multi-rank. ``lmp_list.nswap > 0`` is the proxy for | ||
| // "multi-rank with cross-domain communication"; in single-rank | ||
| // mode LAMMPS sets nswap=0. Falling back to the regular artifact | ||
| // for nswap=0 is correct because that artifact uses the mapping | ||
| // tensor to gather ghost embeddings from local atoms. | ||
| // ``use_with_comm`` was computed earlier alongside the fail-fast | ||
| // dispatch check. Use the with-comm artifact for the multi-rank case | ||
| // (the regular artifact uses the mapping tensor to gather ghost | ||
| // embeddings, which only works in single-rank). | ||
| std::vector<torch::Tensor> flat_outputs; | ||
| bool use_with_comm = has_comm_artifact_ && lmp_list.nswap > 0; | ||
| if (use_with_comm && !with_comm_loader) { | ||
| throw deepmd::deepmd_exception( | ||
| "Multi-rank LAMMPS requires the with-comm artifact, but it failed " | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: this currently derives
has_message_passingonly frommodel.atomic_model.descriptor.has_message_passing(). That is fine for the normal DPA2/DPA3 export path, but it may under-report for future/alternate wrappers where the top-level model (or atomic_model) exposeshas_message_passing()without a directly exposeddescriptor. Would it be safer to first trymodel.has_message_passing()/model.atomic_model.has_message_passing()and only then fall back toatomic_model.descriptor.has_message_passing()?— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)