Skip to content

fix: bypass gloo DDP for Windows single-GPU training#2744

Open
fanfan-love-meatmeat wants to merge 2 commits intoRVC-Boss:mainfrom
fanfan-love-meatmeat:fix/windows-singlegpu-gloo
Open

fix: bypass gloo DDP for Windows single-GPU training#2744
fanfan-love-meatmeat wants to merge 2 commits intoRVC-Boss:mainfrom
fanfan-love-meatmeat:fix/windows-singlegpu-gloo

Conversation

@fanfan-love-meatmeat
Copy link
Copy Markdown

Problem

On Windows with a single GPU, dist.init_process_group() using the gloo backend fails with:
RuntimeError: unsupported gloo device

This is caused by virtual network adapters (VPN, VMware, Hyper-V, etc.) interfering with gloo's network interface detection.

Solution

Skip DDP initialization entirely for Windows single-GPU setups, rather than patching gloo environment variables.

Changes

  • s2_train.py: Skip dist.init_process_group() on Windows single-GPU; add DummyDDP wrapper to maintain .module interface compatibility
  • s1_train.py: Set USE_LIBUV=0 to avoid socket conflicts; use strategy='auto' for single-GPU (bypasses gloo entirely)
  • utils.py / bucket_sampler.py: Related compatibility fixes

Tested on

  • Windows 11, single NVIDIA GPU 5060RTX, Python 3.10, PyTorch 2.5, CUDA 12.4

On Windows with a single GPU, dist.init_process_group() using the
gloo backend frequently fails with 'unsupported gloo device', caused
by virtual network adapters (VPN, VMware, Hyper-V, etc.).

Changes:
- s2_train.py: skip dist.init_process_group() on Windows single-GPU;
  add DummyDDP wrapper to maintain .module interface compatibility
- s1_train.py: set USE_LIBUV=0 to avoid socket conflicts;
  use strategy='auto' for single-GPU (bypasses gloo entirely),
  DDPStrategy only activated for multi-GPU setups
- utils.py, bucket_sampler.py: related compatibility adjustments

Tested on: Windows 11, single NVIDIA GPU, Python 3.10, PyTorch 2.5
…train_v3_lora

Extend the fix to v3 and LoRA training scripts:
- s2_train_v3.py: skip dist.init_process_group() + DummyDDP for Windows single-GPU
- s2_train_v3_lora.py: same fix applied to LoRA fine-tuning script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant