Skip to content

fix: fake border op import#5451

Open
anyangml wants to merge 2 commits into
deepmodeling:masterfrom
anyangml:fix/border-op-import
Open

fix: fake border op import#5451
anyangml wants to merge 2 commits into
deepmodeling:masterfrom
anyangml:fix/border-op-import

Conversation

@anyangml
Copy link
Copy Markdown
Collaborator

@anyangml anyangml commented May 21, 2026

Summary by CodeRabbit

  • Refactor
    • Deferred internal operation registration from import/initialization time to first use, preventing registration-order conflicts and improving test initialization reliability.
  • Documentation
    • Clarified module and test docs to note the lazy registration behavior and when the operation library is loaded.

Review Change Stack

Copilot AI review requested due to automatic review settings May 21, 2026 09:18
@dosubot dosubot Bot added the bug label May 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4c88fe65-df19-40ed-9137-fe94c160d5a1

📥 Commits

Reviewing files that changed from the base of the PR and between 824a5a3 and b64c625.

📒 Files selected for processing (1)
  • deepmd/pt_expt/utils/serialization.py

📝 Walkthrough

Walkthrough

This PR defers registration of deepmd_export::border_op from import-time side effects to lazy initialization at trace time. A new ensure_comm_registered() function is added to comm.py and invoked in serialization.py when with_comm_dict=True, avoiding operator registration order issues in tests.

Changes

Deferred comm operator registration

Layer / File(s) Summary
Lazy registration mechanism
deepmd/pt_expt/utils/comm.py
Module-level _registered flag and ensure_comm_registered() function implement idempotent lazy registration of border_op and border_op_backward fakes and autograd. Import-time _check_underlying_ops_loaded() call and decorator-based registration are removed.
Serialization export integration
deepmd/pt_expt/utils/serialization.py
_trace_and_export calls ensure_comm_registered() when with_comm_dict=True to trigger operator registration before comm artifact validation and sample input construction.
Module and test documentation updates
deepmd/pt_expt/utils/__init__.py, source/tests/pt_expt/conftest.py
Comments updated to reflect lazy loading semantics: operator registration is deferred to the with_comm_dict export path via ensure_comm_registered() rather than occurring at module import.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: fake border op import' is specific and directly related to the main change: deferring registration of border_op fake implementations from import-time to lazy initialization via ensure_comm_registered().
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pt_expt/utils/comm.py`:
- Around line 192-213: ensure_comm_registered currently checks global
_registered without synchronization, allowing a race; wrap the registration
block in a module-level lock (e.g., _register_lock) inside
ensure_comm_registered, acquire the lock, re-check _registered, then perform the
fake registrations and torch.library.register_autograd calls (referencing
_border_op_fake, _border_op_backward_fake, _border_op_backward, and
_border_op_setup_context) and set _registered = True before releasing the lock
to make the lazy-registration atomic and idempotent.

In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 672-679: The code currently imports and calls
ensure_comm_registered() before checking whether the model actually requires a
comm artifact, which can prematurely load libdeepmd_op_pt.so; move the
_needs_with_comm_artifact(model) check before importing/calling
ensure_comm_registered so you reject non-comm models first. Concretely, evaluate
if not _needs_with_comm_artifact(model) and raise the existing ValueError (or
return) before executing the from deepmd.pt_expt.utils.comm import
ensure_comm_registered and ensure_comm_registered() calls; keep the import/call
only in the branch where _needs_with_comm_artifact(model) is True.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: dd6a77be-24bf-427d-b826-ddd0d908fb81

📥 Commits

Reviewing files that changed from the base of the PR and between d3f08f3 and 824a5a3.

📒 Files selected for processing (4)
  • deepmd/pt_expt/utils/__init__.py
  • deepmd/pt_expt/utils/comm.py
  • deepmd/pt_expt/utils/serialization.py
  • source/tests/pt_expt/conftest.py

Comment on lines +192 to +213
global _registered
if _registered:
return
_check_underlying_ops_loaded()
try:
torch.library.register_fake("deepmd_export::border_op")(_border_op_fake)
except RuntimeError as e:
if "already has" not in str(e) and "already registered" not in str(e):
raise
try:
torch.library.register_fake("deepmd_export::border_op_backward")(
_border_op_backward_fake
)
except RuntimeError as e:
if "already has" not in str(e) and "already registered" not in str(e):
raise
torch.library.register_autograd(
"deepmd_export::border_op",
_border_op_backward,
setup_context=_border_op_setup_context,
)
_registered = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the lazy-registration guard atomic.

ensure_comm_registered() is documented as idempotent, but two callers can both pass Line 193 before _registered flips and then race through the global registration path. Please protect the block with a module-level lock and re-check _registered inside it.

Suggested fix
+import threading
+
 import torch
 
 _registered: bool = False
+_register_lock = threading.Lock()
 ...
 def ensure_comm_registered() -> None:
     ...
     global _registered
     if _registered:
         return
-    _check_underlying_ops_loaded()
-    try:
-        torch.library.register_fake("deepmd_export::border_op")(_border_op_fake)
-    except RuntimeError as e:
-        if "already has" not in str(e) and "already registered" not in str(e):
-            raise
-    try:
-        torch.library.register_fake("deepmd_export::border_op_backward")(
-            _border_op_backward_fake
-        )
-    except RuntimeError as e:
-        if "already has" not in str(e) and "already registered" not in str(e):
-            raise
-    torch.library.register_autograd(
-        "deepmd_export::border_op",
-        _border_op_backward,
-        setup_context=_border_op_setup_context,
-    )
-    _registered = True
+    with _register_lock:
+        if _registered:
+            return
+        _check_underlying_ops_loaded()
+        try:
+            torch.library.register_fake("deepmd_export::border_op")(_border_op_fake)
+        except RuntimeError as e:
+            if "already has" not in str(e) and "already registered" not in str(e):
+                raise
+        try:
+            torch.library.register_fake("deepmd_export::border_op_backward")(
+                _border_op_backward_fake
+            )
+        except RuntimeError as e:
+            if "already has" not in str(e) and "already registered" not in str(e):
+                raise
+        torch.library.register_autograd(
+            "deepmd_export::border_op",
+            _border_op_backward,
+            setup_context=_border_op_setup_context,
+        )
+        _registered = True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
global _registered
if _registered:
return
_check_underlying_ops_loaded()
try:
torch.library.register_fake("deepmd_export::border_op")(_border_op_fake)
except RuntimeError as e:
if "already has" not in str(e) and "already registered" not in str(e):
raise
try:
torch.library.register_fake("deepmd_export::border_op_backward")(
_border_op_backward_fake
)
except RuntimeError as e:
if "already has" not in str(e) and "already registered" not in str(e):
raise
torch.library.register_autograd(
"deepmd_export::border_op",
_border_op_backward,
setup_context=_border_op_setup_context,
)
_registered = True
global _registered
if _registered:
return
with _register_lock:
if _registered:
return
_check_underlying_ops_loaded()
try:
torch.library.register_fake("deepmd_export::border_op")(_border_op_fake)
except RuntimeError as e:
if "already has" not in str(e) and "already registered" not in str(e):
raise
try:
torch.library.register_fake("deepmd_export::border_op_backward")(
_border_op_backward_fake
)
except RuntimeError as e:
if "already has" not in str(e) and "already registered" not in str(e):
raise
torch.library.register_autograd(
"deepmd_export::border_op",
_border_op_backward,
setup_context=_border_op_setup_context,
)
_registered = True
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/utils/comm.py` around lines 192 - 213, ensure_comm_registered
currently checks global _registered without synchronization, allowing a race;
wrap the registration block in a module-level lock (e.g., _register_lock) inside
ensure_comm_registered, acquire the lock, re-check _registered, then perform the
fake registrations and torch.library.register_autograd calls (referencing
_border_op_fake, _border_op_backward_fake, _border_op_backward, and
_border_op_setup_context) and set _registered = True before releasing the lock
to make the lazy-registration atomic and idempotent.

Comment on lines 672 to 679
if with_comm_dict:
# Load libdeepmd_op_pt.so and register border_op fake/autograd
# metadata now — deferred from import time so normal utils imports
# don't force-load the op library and break fake-op ordering.
from deepmd.pt_expt.utils.comm import ensure_comm_registered

ensure_comm_registered()
if not _needs_with_comm_artifact(model):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-comm models before loading the comm op library.

Lines 676-678 force libdeepmd_op_pt.so to load before the code checks whether this model even needs a comm artifact. That brings back the side effect this PR is trying to avoid on invalid with_comm_dict=True calls, and it can replace the intended ValueError with an unrelated op-loading failure. Please move the _needs_with_comm_artifact(model) check ahead of ensure_comm_registered().

Suggested fix
     if with_comm_dict:
-        # Load libdeepmd_op_pt.so and register border_op fake/autograd
-        # metadata now — deferred from import time so normal utils imports
-        # don't force-load the op library and break fake-op ordering.
-        from deepmd.pt_expt.utils.comm import ensure_comm_registered
-
-        ensure_comm_registered()
         if not _needs_with_comm_artifact(model):
             raise ValueError(
                 "with_comm_dict=True requested but the model's descriptor "
                 "does not need cross-rank message passing "
                 "(has_message_passing_across_ranks() is False) — "
                 "there's nothing to compile."
             )
+        # Load libdeepmd_op_pt.so and register border_op fake/autograd
+        # metadata only for models that actually need the comm path.
+        from deepmd.pt_expt.utils.comm import ensure_comm_registered
+
+        ensure_comm_registered()
         nloc_sample = nlist_t.shape[1]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/utils/serialization.py` around lines 672 - 679, The code
currently imports and calls ensure_comm_registered() before checking whether the
model actually requires a comm artifact, which can prematurely load
libdeepmd_op_pt.so; move the _needs_with_comm_artifact(model) check before
importing/calling ensure_comm_registered so you reject non-comm models first.
Concretely, evaluate if not _needs_with_comm_artifact(model) and raise the
existing ValueError (or return) before executing the from
deepmd.pt_expt.utils.comm import ensure_comm_registered and
ensure_comm_registered() calls; keep the import/call only in the branch where
_needs_with_comm_artifact(model) is True.

_border_op_backward,
setup_context=_border_op_setup_context,
)
_registered = True
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes PyTorch custom-op (border_op) registration for pt_expt lazy, deferring libdeepmd_op_pt.so loading and the fake/autograd metadata registration until the comm-dict export path is actually exercised. The intent is to avoid import-time side effects that can cause fake-op registration order conflicts, especially in tests.

Changes:

  • Introduces ensure_comm_registered() in deepmd.pt_expt.utils.comm to explicitly load the op library and register fake/autograd metadata on demand.
  • Calls ensure_comm_registered() from the with_comm_dict export path in serialization.py and removes eager import of comm from deepmd.pt_expt.utils.__init__.
  • Updates test conftest/comments to reflect the new lazy-loading behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
source/tests/pt_expt/conftest.py Updates commentary to describe lazy comm/op loading behavior in tests.
deepmd/pt_expt/utils/serialization.py Lazily imports and calls ensure_comm_registered() in the with_comm_dict export path.
deepmd/pt_expt/utils/comm.py Adds explicit, idempotent ensure_comm_registered() and removes import-time registration side effects.
deepmd/pt_expt/utils/__init__.py Stops importing comm at package import time; documents lazy registration approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 672 to 682
if with_comm_dict:
# Load libdeepmd_op_pt.so and register border_op fake/autograd
# metadata now — deferred from import time so normal utils imports
# don't force-load the op library and break fake-op ordering.
from deepmd.pt_expt.utils.comm import (
ensure_comm_registered,
)

ensure_comm_registered()
if not _needs_with_comm_artifact(model):
raise ValueError(
Comment on lines +180 to +190
def ensure_comm_registered() -> None:
"""Load libdeepmd_op_pt.so and register fake/autograd metadata for border_op.

Idempotent — safe to call multiple times. Must be called before any
``make_fx`` / ``torch.export`` trace that passes through border_op (i.e.
before the ``with_comm_dict=True`` export path in serialization.py).

Kept lazy (not called at import time) so that merely importing
``deepmd.pt_expt.utils`` does not force-load libdeepmd_op_pt.so and
disrupt fake-op registration order in tests that don't exercise the comm
path at all.
Comment on lines +208 to +212
torch.library.register_autograd(
"deepmd_export::border_op",
_border_op_backward,
setup_context=_border_op_setup_context,
)
Comment on lines +26 to +29
# comm.py (border_op fake/autograd) is NOT imported here — its
# ensure_comm_registered() is called lazily from the with_comm_dict
# export path in serialization.py to avoid eager libdeepmd_op_pt.so
# loading that breaks fake-op registration order in tests.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 63.15789% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.47%. Comparing base (d3f08f3) to head (b64c625).

Files with missing lines Patch % Lines
deepmd/pt_expt/utils/comm.py 58.82% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5451      +/-   ##
==========================================
- Coverage   82.48%   82.47%   -0.01%     
==========================================
  Files         830      830              
  Lines       88522    88535      +13     
  Branches     4232     4233       +1     
==========================================
+ Hits        73015    73018       +3     
- Misses      14220    14229       +9     
- Partials     1287     1288       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants