fix: fake border op import#5451
Conversation
for more information, see https://pre-commit.ci
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR defers registration of ChangesDeferred comm operator registration
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepmd/pt_expt/utils/comm.py`:
- Around line 192-213: ensure_comm_registered currently checks global
_registered without synchronization, allowing a race; wrap the registration
block in a module-level lock (e.g., _register_lock) inside
ensure_comm_registered, acquire the lock, re-check _registered, then perform the
fake registrations and torch.library.register_autograd calls (referencing
_border_op_fake, _border_op_backward_fake, _border_op_backward, and
_border_op_setup_context) and set _registered = True before releasing the lock
to make the lazy-registration atomic and idempotent.
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 672-679: The code currently imports and calls
ensure_comm_registered() before checking whether the model actually requires a
comm artifact, which can prematurely load libdeepmd_op_pt.so; move the
_needs_with_comm_artifact(model) check before importing/calling
ensure_comm_registered so you reject non-comm models first. Concretely, evaluate
if not _needs_with_comm_artifact(model) and raise the existing ValueError (or
return) before executing the from deepmd.pt_expt.utils.comm import
ensure_comm_registered and ensure_comm_registered() calls; keep the import/call
only in the branch where _needs_with_comm_artifact(model) is True.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: dd6a77be-24bf-427d-b826-ddd0d908fb81
📒 Files selected for processing (4)
deepmd/pt_expt/utils/__init__.pydeepmd/pt_expt/utils/comm.pydeepmd/pt_expt/utils/serialization.pysource/tests/pt_expt/conftest.py
| global _registered | ||
| if _registered: | ||
| return | ||
| _check_underlying_ops_loaded() | ||
| try: | ||
| torch.library.register_fake("deepmd_export::border_op")(_border_op_fake) | ||
| except RuntimeError as e: | ||
| if "already has" not in str(e) and "already registered" not in str(e): | ||
| raise | ||
| try: | ||
| torch.library.register_fake("deepmd_export::border_op_backward")( | ||
| _border_op_backward_fake | ||
| ) | ||
| except RuntimeError as e: | ||
| if "already has" not in str(e) and "already registered" not in str(e): | ||
| raise | ||
| torch.library.register_autograd( | ||
| "deepmd_export::border_op", | ||
| _border_op_backward, | ||
| setup_context=_border_op_setup_context, | ||
| ) | ||
| _registered = True |
There was a problem hiding this comment.
Make the lazy-registration guard atomic.
ensure_comm_registered() is documented as idempotent, but two callers can both pass Line 193 before _registered flips and then race through the global registration path. Please protect the block with a module-level lock and re-check _registered inside it.
Suggested fix
+import threading
+
import torch
_registered: bool = False
+_register_lock = threading.Lock()
...
def ensure_comm_registered() -> None:
...
global _registered
if _registered:
return
- _check_underlying_ops_loaded()
- try:
- torch.library.register_fake("deepmd_export::border_op")(_border_op_fake)
- except RuntimeError as e:
- if "already has" not in str(e) and "already registered" not in str(e):
- raise
- try:
- torch.library.register_fake("deepmd_export::border_op_backward")(
- _border_op_backward_fake
- )
- except RuntimeError as e:
- if "already has" not in str(e) and "already registered" not in str(e):
- raise
- torch.library.register_autograd(
- "deepmd_export::border_op",
- _border_op_backward,
- setup_context=_border_op_setup_context,
- )
- _registered = True
+ with _register_lock:
+ if _registered:
+ return
+ _check_underlying_ops_loaded()
+ try:
+ torch.library.register_fake("deepmd_export::border_op")(_border_op_fake)
+ except RuntimeError as e:
+ if "already has" not in str(e) and "already registered" not in str(e):
+ raise
+ try:
+ torch.library.register_fake("deepmd_export::border_op_backward")(
+ _border_op_backward_fake
+ )
+ except RuntimeError as e:
+ if "already has" not in str(e) and "already registered" not in str(e):
+ raise
+ torch.library.register_autograd(
+ "deepmd_export::border_op",
+ _border_op_backward,
+ setup_context=_border_op_setup_context,
+ )
+ _registered = True📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| global _registered | |
| if _registered: | |
| return | |
| _check_underlying_ops_loaded() | |
| try: | |
| torch.library.register_fake("deepmd_export::border_op")(_border_op_fake) | |
| except RuntimeError as e: | |
| if "already has" not in str(e) and "already registered" not in str(e): | |
| raise | |
| try: | |
| torch.library.register_fake("deepmd_export::border_op_backward")( | |
| _border_op_backward_fake | |
| ) | |
| except RuntimeError as e: | |
| if "already has" not in str(e) and "already registered" not in str(e): | |
| raise | |
| torch.library.register_autograd( | |
| "deepmd_export::border_op", | |
| _border_op_backward, | |
| setup_context=_border_op_setup_context, | |
| ) | |
| _registered = True | |
| global _registered | |
| if _registered: | |
| return | |
| with _register_lock: | |
| if _registered: | |
| return | |
| _check_underlying_ops_loaded() | |
| try: | |
| torch.library.register_fake("deepmd_export::border_op")(_border_op_fake) | |
| except RuntimeError as e: | |
| if "already has" not in str(e) and "already registered" not in str(e): | |
| raise | |
| try: | |
| torch.library.register_fake("deepmd_export::border_op_backward")( | |
| _border_op_backward_fake | |
| ) | |
| except RuntimeError as e: | |
| if "already has" not in str(e) and "already registered" not in str(e): | |
| raise | |
| torch.library.register_autograd( | |
| "deepmd_export::border_op", | |
| _border_op_backward, | |
| setup_context=_border_op_setup_context, | |
| ) | |
| _registered = True |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt_expt/utils/comm.py` around lines 192 - 213, ensure_comm_registered
currently checks global _registered without synchronization, allowing a race;
wrap the registration block in a module-level lock (e.g., _register_lock) inside
ensure_comm_registered, acquire the lock, re-check _registered, then perform the
fake registrations and torch.library.register_autograd calls (referencing
_border_op_fake, _border_op_backward_fake, _border_op_backward, and
_border_op_setup_context) and set _registered = True before releasing the lock
to make the lazy-registration atomic and idempotent.
| if with_comm_dict: | ||
| # Load libdeepmd_op_pt.so and register border_op fake/autograd | ||
| # metadata now — deferred from import time so normal utils imports | ||
| # don't force-load the op library and break fake-op ordering. | ||
| from deepmd.pt_expt.utils.comm import ensure_comm_registered | ||
|
|
||
| ensure_comm_registered() | ||
| if not _needs_with_comm_artifact(model): |
There was a problem hiding this comment.
Reject non-comm models before loading the comm op library.
Lines 676-678 force libdeepmd_op_pt.so to load before the code checks whether this model even needs a comm artifact. That brings back the side effect this PR is trying to avoid on invalid with_comm_dict=True calls, and it can replace the intended ValueError with an unrelated op-loading failure. Please move the _needs_with_comm_artifact(model) check ahead of ensure_comm_registered().
Suggested fix
if with_comm_dict:
- # Load libdeepmd_op_pt.so and register border_op fake/autograd
- # metadata now — deferred from import time so normal utils imports
- # don't force-load the op library and break fake-op ordering.
- from deepmd.pt_expt.utils.comm import ensure_comm_registered
-
- ensure_comm_registered()
if not _needs_with_comm_artifact(model):
raise ValueError(
"with_comm_dict=True requested but the model's descriptor "
"does not need cross-rank message passing "
"(has_message_passing_across_ranks() is False) — "
"there's nothing to compile."
)
+ # Load libdeepmd_op_pt.so and register border_op fake/autograd
+ # metadata only for models that actually need the comm path.
+ from deepmd.pt_expt.utils.comm import ensure_comm_registered
+
+ ensure_comm_registered()
nloc_sample = nlist_t.shape[1]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepmd/pt_expt/utils/serialization.py` around lines 672 - 679, The code
currently imports and calls ensure_comm_registered() before checking whether the
model actually requires a comm artifact, which can prematurely load
libdeepmd_op_pt.so; move the _needs_with_comm_artifact(model) check before
importing/calling ensure_comm_registered so you reject non-comm models first.
Concretely, evaluate if not _needs_with_comm_artifact(model) and raise the
existing ValueError (or return) before executing the from
deepmd.pt_expt.utils.comm import ensure_comm_registered and
ensure_comm_registered() calls; keep the import/call only in the branch where
_needs_with_comm_artifact(model) is True.
| _border_op_backward, | ||
| setup_context=_border_op_setup_context, | ||
| ) | ||
| _registered = True |
There was a problem hiding this comment.
Pull request overview
This PR makes PyTorch custom-op (border_op) registration for pt_expt lazy, deferring libdeepmd_op_pt.so loading and the fake/autograd metadata registration until the comm-dict export path is actually exercised. The intent is to avoid import-time side effects that can cause fake-op registration order conflicts, especially in tests.
Changes:
- Introduces
ensure_comm_registered()indeepmd.pt_expt.utils.commto explicitly load the op library and register fake/autograd metadata on demand. - Calls
ensure_comm_registered()from thewith_comm_dictexport path inserialization.pyand removes eager import ofcommfromdeepmd.pt_expt.utils.__init__. - Updates test conftest/comments to reflect the new lazy-loading behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
source/tests/pt_expt/conftest.py |
Updates commentary to describe lazy comm/op loading behavior in tests. |
deepmd/pt_expt/utils/serialization.py |
Lazily imports and calls ensure_comm_registered() in the with_comm_dict export path. |
deepmd/pt_expt/utils/comm.py |
Adds explicit, idempotent ensure_comm_registered() and removes import-time registration side effects. |
deepmd/pt_expt/utils/__init__.py |
Stops importing comm at package import time; documents lazy registration approach. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if with_comm_dict: | ||
| # Load libdeepmd_op_pt.so and register border_op fake/autograd | ||
| # metadata now — deferred from import time so normal utils imports | ||
| # don't force-load the op library and break fake-op ordering. | ||
| from deepmd.pt_expt.utils.comm import ( | ||
| ensure_comm_registered, | ||
| ) | ||
|
|
||
| ensure_comm_registered() | ||
| if not _needs_with_comm_artifact(model): | ||
| raise ValueError( |
| def ensure_comm_registered() -> None: | ||
| """Load libdeepmd_op_pt.so and register fake/autograd metadata for border_op. | ||
|
|
||
| Idempotent — safe to call multiple times. Must be called before any | ||
| ``make_fx`` / ``torch.export`` trace that passes through border_op (i.e. | ||
| before the ``with_comm_dict=True`` export path in serialization.py). | ||
|
|
||
| Kept lazy (not called at import time) so that merely importing | ||
| ``deepmd.pt_expt.utils`` does not force-load libdeepmd_op_pt.so and | ||
| disrupt fake-op registration order in tests that don't exercise the comm | ||
| path at all. |
| torch.library.register_autograd( | ||
| "deepmd_export::border_op", | ||
| _border_op_backward, | ||
| setup_context=_border_op_setup_context, | ||
| ) |
| # comm.py (border_op fake/autograd) is NOT imported here — its | ||
| # ensure_comm_registered() is called lazily from the with_comm_dict | ||
| # export path in serialization.py to avoid eager libdeepmd_op_pt.so | ||
| # loading that breaks fake-op registration order in tests. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5451 +/- ##
==========================================
- Coverage 82.48% 82.47% -0.01%
==========================================
Files 830 830
Lines 88522 88535 +13
Branches 4232 4233 +1
==========================================
+ Hits 73015 73018 +3
- Misses 14220 14229 +9
- Partials 1287 1288 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary by CodeRabbit