fix: try remove memory footprint#5449
Conversation
There was a problem hiding this comment.
Pull request overview
This PR attempts to reduce peak/retained memory during torch.compile setup in the PyTorch experimental trainer by explicitly releasing intermediate tensors after FX tracing and after per-task compilation setup (especially for multi-task scenarios).
Changes:
- Delete trace seed tensors after
make_fxcaptures the graph in_trace_and_compile. - Store the compiled module in a local variable, delete
traced_lower, then return the compiled wrapper. - In
_compile_model, delete per-task intermediate tensors after installing the compiled wrapper to avoid accumulation across tasks, and calltorch.cuda.empty_cache().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughDeletes temporary tensors created during torch.fx tracing and per-task compilation, warms up lazily compiled modules, conditionally clears CUDA cache when CUDA is active, and converts training/validation metric tensors to Python scalars for logging/aggregation. ChangesMemory management in model compilation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5449 +/- ##
=======================================
Coverage 82.48% 82.48%
=======================================
Files 830 830
Lines 88522 88581 +59
Branches 4232 4232
=======================================
+ Hits 73015 73070 +55
- Misses 14220 14226 +6
+ Partials 1287 1285 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>
for more information, see https://pre-commit.ci
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Anyang Peng <137014849+anyangml@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
deepmd/pt_expt/train/training.py (1)
1023-1024: ⚡ Quick winAdd
torch.cuda.is_available()check for consistency and defensive programming.The CUDA cache cleanup at line 1023 checks
DEVICE.type == "cuda" and torch.cuda.is_initialized(), but the similar logic in_trace_and_compile(line 336) includes an additionaltorch.cuda.is_available()check. While functionally safe (if CUDA is unavailable,is_initialized()returns False), addingis_available()is an idiomatic PyTorch defensive pattern and improves consistency across the codebase.♻️ Suggested improvement for consistency
- if DEVICE.type == "cuda" and torch.cuda.is_initialized(): + if DEVICE.type == "cuda" and torch.cuda.is_available() and torch.cuda.is_initialized(): torch.cuda.empty_cache()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt_expt/train/training.py` around lines 1023 - 1024, Update the CUDA cache cleanup conditional to include torch.cuda.is_available() for defensive consistency: where you currently check DEVICE.type == "cuda" and torch.cuda.is_initialized() (in training.py), add torch.cuda.is_available() to the condition so it mirrors the check used in _trace_and_compile and guards against unavailable CUDA devices; modify the condition that references DEVICE and torch.cuda.is_initialized() to include torch.cuda.is_available().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 1023-1024: Update the CUDA cache cleanup conditional to include
torch.cuda.is_available() for defensive consistency: where you currently check
DEVICE.type == "cuda" and torch.cuda.is_initialized() (in training.py), add
torch.cuda.is_available() to the condition so it mirrors the check used in
_trace_and_compile and guards against unavailable CUDA devices; modify the
condition that references DEVICE and torch.cuda.is_initialized() to include
torch.cuda.is_available().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b0c09d7f-9c9f-4236-97ff-991995ee8557
📒 Files selected for processing (1)
deepmd/pt_expt/train/training.py
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There are two issues hindering multitask DDP training in pt-expt compile mode:
Solutions:
Summary by CodeRabbit