Skip to content

fix: try remove memory footprint#5449

Open
anyangml wants to merge 13 commits into
deepmodeling:masterfrom
anyangml:fix/compile-multitask-oom
Open

fix: try remove memory footprint#5449
anyangml wants to merge 13 commits into
deepmodeling:masterfrom
anyangml:fix/compile-multitask-oom

Conversation

@anyangml
Copy link
Copy Markdown
Collaborator

@anyangml anyangml commented May 20, 2026

There are two issues hindering multitask DDP training in pt-expt compile mode:

    1. memory footprint scales with N tasks;
    1. lazy compile cause NCCL timeout.

Solutions:

    1. clean up memory footprint after compiling each task
    1. compile all task before training start and sync.

Summary by CodeRabbit

  • Bug Fixes
    • Reduced memory usage during model tracing/compilation by immediately releasing temporary trace and compiled artifacts and conditionally clearing GPU cache when CUDA is active.
    • Improved multi-task compilation stability by cleaning up per-task intermediate data right after each task.
    • Ensured logged and validation metrics are converted to Python scalars for accurate aggregation and reporting.

Review Change Stack

Copilot AI review requested due to automatic review settings May 20, 2026 10:20
@dosubot dosubot Bot added the bug label May 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to reduce peak/retained memory during torch.compile setup in the PyTorch experimental trainer by explicitly releasing intermediate tensors after FX tracing and after per-task compilation setup (especially for multi-task scenarios).

Changes:

  • Delete trace seed tensors after make_fx captures the graph in _trace_and_compile.
  • Store the compiled module in a local variable, delete traced_lower, then return the compiled wrapper.
  • In _compile_model, delete per-task intermediate tensors after installing the compiled wrapper to avoid accumulation across tasks, and call torch.cuda.empty_cache().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deepmd/pt_expt/train/training.py Outdated
Comment thread deepmd/pt_expt/train/training.py Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Deletes temporary tensors created during torch.fx tracing and per-task compilation, warms up lazily compiled modules, conditionally clears CUDA cache when CUDA is active, and converts training/validation metric tensors to Python scalars for logging/aggregation.

Changes

Memory management in model compilation

Layer / File(s) Summary
Single-model trace and compile cleanup
deepmd/pt_expt/train/training.py
After make_fx produces the FX graph, trace-time inputs (ext_coord, ext_atype, nlist, mapping, optional fparam/aparam) are deleted; torch.compile result is stored in compiled, traced_lower is deleted, and torch.cuda.empty_cache() is called only when model params/buffers are on CUDA and CUDA is initialized.
Per-task compilation warmup and cleanup
deepmd/pt_expt/train/training.py
Before wrapping the model in _CompiledModel, the lazily-compiled compiled_lower is warmed up with the task's sample inputs (warmup output deleted), optional torch.cuda.synchronize() is run for CUDA, _CompiledModel is installed into wrapper_mod.model[task_key], and per-task tracing/compilation intermediates are deleted; conditional CUDA cache clearing follows and a distributed barrier is retained.
Single-task logging and validation scalarization
deepmd/pt_expt/train/training.py
Training logging converts more_loss values to Python scalars via .item() (excluding keys starting with l2_); validation accumulation converts per-batch more_loss tensors to scalars before summing into valid_results.
Multi-task logging, forward-release, and validation scalarization
deepmd/pt_expt/train/training.py
Multi-task training scalarizes more_loss for current and other tasks with .item(); after computing other tasks' contributions local intermediates (_loss, _more, inputs/labels) are deleted and CUDA cache cleared conditionally. Multi-task validation aggregates per-batch metrics using .item() to accumulate scalars per task.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix: try remove memory footprint' is vague and uses non-descriptive language ('try remove') that doesn't clearly convey what specific memory optimization was implemented. Make the title more specific about the memory optimization, such as 'fix: reduce memory footprint by cleaning up intermediate tensors' or 'fix: add tensor cleanup and cache clearing in compilation and training loop'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 91.93548% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.48%. Comparing base (d3f08f3) to head (c05542a).

Files with missing lines Patch % Lines
deepmd/pt_expt/train/training.py 91.93% 5 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5449   +/-   ##
=======================================
  Coverage   82.48%   82.48%           
=======================================
  Files         830      830           
  Lines       88522    88581   +59     
  Branches     4232     4232           
=======================================
+ Hits        73015    73070   +55     
- Misses      14220    14226    +6     
+ Partials     1287     1285    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

anyangml and others added 3 commits May 20, 2026 19:33
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread deepmd/pt_expt/train/training.py Outdated
Comment thread deepmd/pt_expt/train/training.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deepmd/pt_expt/train/training.py (1)

1023-1024: ⚡ Quick win

Add torch.cuda.is_available() check for consistency and defensive programming.

The CUDA cache cleanup at line 1023 checks DEVICE.type == "cuda" and torch.cuda.is_initialized(), but the similar logic in _trace_and_compile (line 336) includes an additional torch.cuda.is_available() check. While functionally safe (if CUDA is unavailable, is_initialized() returns False), adding is_available() is an idiomatic PyTorch defensive pattern and improves consistency across the codebase.

♻️ Suggested improvement for consistency
-            if DEVICE.type == "cuda" and torch.cuda.is_initialized():
+            if DEVICE.type == "cuda" and torch.cuda.is_available() and torch.cuda.is_initialized():
                 torch.cuda.empty_cache()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/train/training.py` around lines 1023 - 1024, Update the CUDA
cache cleanup conditional to include torch.cuda.is_available() for defensive
consistency: where you currently check DEVICE.type == "cuda" and
torch.cuda.is_initialized() (in training.py), add torch.cuda.is_available() to
the condition so it mirrors the check used in _trace_and_compile and guards
against unavailable CUDA devices; modify the condition that references DEVICE
and torch.cuda.is_initialized() to include torch.cuda.is_available().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 1023-1024: Update the CUDA cache cleanup conditional to include
torch.cuda.is_available() for defensive consistency: where you currently check
DEVICE.type == "cuda" and torch.cuda.is_initialized() (in training.py), add
torch.cuda.is_available() to the condition so it mirrors the check used in
_trace_and_compile and guards against unavailable CUDA devices; modify the
condition that references DEVICE and torch.cuda.is_initialized() to include
torch.cuda.is_available().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b0c09d7f-9c9f-4236-97ff-991995ee8557

📥 Commits

Reviewing files that changed from the base of the PR and between c7d9f57 and 2a10532.

📒 Files selected for processing (1)
  • deepmd/pt_expt/train/training.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants