fix: try remove memory footprint by anyangml · Pull Request #5449 · deepmodeling/deepmd-kit

anyangml · 2026-05-20T10:20:26Z

There are two issues hindering multitask DDP training in pt-expt compile mode:

1. memory footprint scales with N tasks;
1. lazy compile cause NCCL timeout.

Solutions:

1. clean up memory footprint after compiling each task
1. compile all task before training start and sync.

Summary by CodeRabbit

Bug Fixes
- Reduced memory usage during model tracing/compilation by immediately releasing temporary trace and compiled artifacts and conditionally clearing GPU cache when CUDA is active.
- Improved multi-task compilation stability by cleaning up per-task intermediate data right after each task.
- Ensured logged and validation metrics are converted to Python scalars for accurate aggregation and reporting.

Copilot

Pull request overview

This PR attempts to reduce peak/retained memory during torch.compile setup in the PyTorch experimental trainer by explicitly releasing intermediate tensors after FX tracing and after per-task compilation setup (especially for multi-task scenarios).

Changes:

Delete trace seed tensors after make_fx captures the graph in _trace_and_compile.
Store the compiled module in a local variable, delete traced_lower, then return the compiled wrapper.
In _compile_model, delete per-task intermediate tensors after installing the compiled wrapper to avoid accumulation across tasks, and call torch.cuda.empty_cache().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai · 2026-05-20T10:23:27Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Deletes temporary tensors created during torch.fx tracing and per-task compilation, warms up lazily compiled modules, conditionally clears CUDA cache when CUDA is active, and converts training/validation metric tensors to Python scalars for logging/aggregation.

Changes

Memory management in model compilation

Layer / File(s)	Summary
Single-model trace and compile cleanup `deepmd/pt_expt/train/training.py`	After `make_fx` produces the FX graph, trace-time inputs (`ext_coord`, `ext_atype`, `nlist`, `mapping`, optional `fparam`/`aparam`) are deleted; `torch.compile` result is stored in `compiled`, `traced_lower` is deleted, and `torch.cuda.empty_cache()` is called only when model params/buffers are on CUDA and CUDA is initialized.
Per-task compilation warmup and cleanup `deepmd/pt_expt/train/training.py`	Before wrapping the model in `_CompiledModel`, the lazily-compiled `compiled_lower` is warmed up with the task's sample inputs (warmup output deleted), optional `torch.cuda.synchronize()` is run for CUDA, `_CompiledModel` is installed into `wrapper_mod.model[task_key]`, and per-task tracing/compilation intermediates are deleted; conditional CUDA cache clearing follows and a distributed barrier is retained.
Single-task logging and validation scalarization `deepmd/pt_expt/train/training.py`	Training logging converts `more_loss` values to Python scalars via `.item()` (excluding keys starting with `l2_`); validation accumulation converts per-batch `more_loss` tensors to scalars before summing into `valid_results`.
Multi-task logging, forward-release, and validation scalarization `deepmd/pt_expt/train/training.py`	Multi-task training scalarizes `more_loss` for current and other tasks with `.item()`; after computing other tasks' contributions local intermediates (`_loss`, `_more`, inputs/labels) are deleted and CUDA cache cleared conditionally. Multi-task validation aggregates per-batch metrics using `.item()` to accumulate scalars per task.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'fix: try remove memory footprint' is vague and uses non-descriptive language ('try remove') that doesn't clearly convey what specific memory optimization was implemented.	Make the title more specific about the memory optimization, such as 'fix: reduce memory footprint by cleaning up intermediate tensors' or 'fix: add tensor cleanup and cache clearing in compilation and training loop'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-20T11:17:27Z

Codecov Report

❌ Patch coverage is 91.93548% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.48%. Comparing base (d3f08f3) to head (c05542a).

Files with missing lines	Patch %	Lines
deepmd/pt_expt/train/training.py	91.93%	5 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5449   +/-   ##
=======================================
  Coverage   82.48%   82.48%           
=======================================
  Files         830      830           
  Lines       88522    88581   +59     
  Branches     4232     4232           
=======================================
+ Hits        73015    73070   +55     
- Misses      14220    14226    +6     
+ Partials     1287     1285    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>

coderabbitai

🧹 Nitpick comments (1)

deepmd/pt_expt/train/training.py (1)
1023-1024: ⚡ Quick win

Add torch.cuda.is_available() check for consistency and defensive programming.

The CUDA cache cleanup at line 1023 checks DEVICE.type == "cuda" and torch.cuda.is_initialized(), but the similar logic in _trace_and_compile (line 336) includes an additional torch.cuda.is_available() check. While functionally safe (if CUDA is unavailable, is_initialized() returns False), adding is_available() is an idiomatic PyTorch defensive pattern and improves consistency across the codebase.
♻️ Suggested improvement for consistency
-            if DEVICE.type == "cuda" and torch.cuda.is_initialized():
+            if DEVICE.type == "cuda" and torch.cuda.is_available() and torch.cuda.is_initialized():
                 torch.cuda.empty_cache()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/train/training.py` around lines 1023 - 1024, Update the CUDA
cache cleanup conditional to include torch.cuda.is_available() for defensive
consistency: where you currently check DEVICE.type == "cuda" and
torch.cuda.is_initialized() (in training.py), add torch.cuda.is_available() to
the condition so it mirrors the check used in _trace_and_compile and guards
against unavailable CUDA devices; modify the condition that references DEVICE
and torch.cuda.is_initialized() to include torch.cuda.is_available().

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 1023-1024: Update the CUDA cache cleanup conditional to include
torch.cuda.is_available() for defensive consistency: where you currently check
DEVICE.type == "cuda" and torch.cuda.is_initialized() (in training.py), add
torch.cuda.is_available() to the condition so it mirrors the check used in
_trace_and_compile and guards against unavailable CUDA devices; modify the
condition that references DEVICE and torch.cuda.is_initialized() to include
torch.cuda.is_available().

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b0c09d7f-9c9f-4236-97ff-991995ee8557

📥 Commits

Reviewing files that changed from the base of the PR and between c7d9f57 and 2a10532.

📒 Files selected for processing (1)

deepmd/pt_expt/train/training.py

for more information, see https://pre-commit.ci

fix: try remove memory footprint

7714b84

Copilot AI review requested due to automatic review settings May 20, 2026 10:20

dosubot Bot added the bug label May 20, 2026

github-actions Bot added the Python label May 20, 2026

Copilot started reviewing on behalf of anyangml May 20, 2026 10:20 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread deepmd/pt_expt/train/training.py Outdated

Comment thread deepmd/pt_expt/train/training.py Outdated

anyangml and others added 3 commits May 20, 2026 19:33

Potential fix for pull request finding

ee413c3

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

eb239ef

for more information, see https://pre-commit.ci

fix: comment

c7d9f57

anyangml requested a review from Copilot May 20, 2026 11:39

Copilot started reviewing on behalf of anyangml May 20, 2026 11:39 View session

anyangml requested review from iProzd and wanghan-iapcm May 20, 2026 11:40

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread deepmd/pt_expt/train/training.py Outdated

Comment thread deepmd/pt_expt/train/training.py Outdated

Potential fix for pull request finding

2a10532

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

anyangml and others added 8 commits May 21, 2026 10:47

fix: remove graph

8b7584a

[pre-commit.ci] auto fixes from pre-commit.com hooks

87e4e46

for more information, see https://pre-commit.ci

fix: lazy compile in multitask NCCL timeout

4ffc15a

fix: mark variable-size dimensions as dynamic to prevent NCCL timeout

36d57a7

[pre-commit.ci] auto fixes from pre-commit.com hooks

1978e33

for more information, see https://pre-commit.ci

fix: mark tensors as dynamic to prevent NCCL timeout during training

937b742

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ee8af1

for more information, see https://pre-commit.ci

fix: dynamic shape

c05542a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: try remove memory footprint#5449

fix: try remove memory footprint#5449
anyangml wants to merge 13 commits into
deepmodeling:masterfrom
anyangml:fix/compile-multitask-oom

anyangml commented May 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anyangml commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anyangml commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading