Skip to content

Add Mooncake Backend for Rollout Data Transfer#1709

Open
zxpdemonio wants to merge 4 commits into
THUDM:mainfrom
zxpdemonio:mooncake
Open

Add Mooncake Backend for Rollout Data Transfer#1709
zxpdemonio wants to merge 4 commits into
THUDM:mainfrom
zxpdemonio:mooncake

Conversation

@zxpdemonio

@zxpdemonio zxpdemonio commented Mar 11, 2026

Copy link
Copy Markdown

Summary

This PR adds Mooncake DataProto rollout transfer as an optional transfer backend for slime. The default backend remains Ray; Mooncake is enabled explicitly for disaggregated rollout/training deployments that use Mooncake Store for cross-node data movement.

The implementation is intentionally low-intrusion:

  • slime keeps its existing rollout dict layout and DP partitioning;
  • only tensor-heavy rollout fields (tokens, loss_masks) are published as Mooncake remote tensor batches;
  • non-tensor rollout fields and metadata stay in a lightweight transfer wrapper;
  • actor/critic-side consumption materializes back into slime's legacy rollout data shape;
  • Mooncake keys are cleaned up by the driver after the actor/critic train refs complete.

Dependency

This PR depends on Mooncake remote tensor helper support from:

The slime implementation uses Mooncake-side APIs such as RemoteTensorBatch, TensorFieldRef, selected tensor materialization, and the registered buffer pool. Merge or install a Mooncake build containing that PR before enabling --transfer-backend mooncake_dataproto in a real deployment.

Motivation

In multi-node RL training, rollout producers and actor consumers often exchange large rollout batches. Ray object transfer works as the default path, but cross-node transfer can spend significant time serializing and moving large tensor payloads.

Mooncake provides a Store/RDMA path and framework-neutral remote tensor helpers for tensor metadata, selected materialization, and reusable registered buffers. This PR wires slime to those helpers without requiring a slime-specific schema in Mooncake.

Implementation

CLI options

Option Default Meaning
--transfer-backend ray Set to mooncake_dataproto to enable Mooncake rollout transfer.
--mooncake-dataproto-store-init-kwargs null JSON kwargs used to initialize Mooncake Store. Use {"setup_method":"setup"} for real setup and {"setup_method":"setup_dummy"} for local tests.
--mooncake-dataproto-hard-pin true Hard-pin remote tensor data to the producer segment when publishing tensor batches.

Key code paths

  1. slime/utils/remote_batch.py

    • adds MooncakeRemoteBatch for tensor fields;
    • wraps Mooncake RemoteTensorBatch / TensorFieldRef metadata;
    • owns Mooncake store setup/cache helpers used by this transfer path;
    • materializes via Mooncake registered buffer pool when available;
    • removes published tensor keys during cleanup.
  2. slime/utils/rollout_dataproto.py

    • keeps a small transfer wrapper for local non-tensor data plus remote tensor metadata;
    • splits rollout data by slime's existing DP partitions;
    • stores only tokens and loss_masks through Mooncake remote tensor batches;
    • converts materialized tensors back to slime's legacy rollout dict shape;
    • tracks cleanup metadata and performs driver-side post-training cleanup.
  3. Existing integration points

    • slime/ray/rollout.py switches to split_rollout_data_by_dp_dataproto() only when --transfer-backend mooncake_dataproto is enabled;
    • slime/utils/data.py materializes Mooncake transfer refs before legacy rollout processing;
    • train.py and train_async.py clean up Mooncake keys after the actor/critic training refs have completed.

Data Flow

Default Ray path is unchanged.

Mooncake DataProto path:

rollout_data dict
    │
    ▼
slime DP partition selection
    │
    ▼
lightweight transfer wrapper
    │
    ├── tokens/loss_masks ─► MooncakeRemoteBatch / RemoteTensorBatch metadata
    ├── other fields ──────► local non_tensor_batch
    └── cleanup metadata ──► meta_info
    │
    ▼
actor/critic process_rollout_data()
    │
    ▼
dataproto_to_rollout_data()
    │
    ├── materialize remote tensor batch fields
    └── restore legacy rollout_data dict
    │
    ▼
actor/critic training
    │
    ▼
driver post-training cleanup removes Mooncake keys

Cleanup is intentionally driver-side after training completion rather than consumer-side after materialization, because the same rollout can be consumed by multiple workers and by both critic and actor.

Usage

python -m slime.train \
  --transfer-backend mooncake_dataproto \
  --mooncake-dataproto-store-init-kwargs '{"setup_method":"setup"}' \
  ...

For local unit tests or smoke tests:

python -m slime.train \
  --transfer-backend mooncake_dataproto \
  --mooncake-dataproto-store-init-kwargs '{"setup_method":"setup_dummy"}' \
  ...

Performance

Test method

Benchmark target: compare slime's Ray rollout transfer path with the new mooncake_dataproto path, using slime's real rollout data structure rather than an invented synthetic schema.

Environment:

  • producer node: 192.168.22.70
  • consumer node: 192.168.22.72
  • Ray cluster: 192.168.22.70:6382
  • Mooncake transport: RDMA
  • Python env: /root/roll/.venv
  • Mooncake transfer mode: low-intrusion remote tensor batch mode
  • warmup: 1 round before measured run
  • measured rounds shown here: 1 round per size
  • DP size: 1
  • sequence length: 2048
  • generated fields match current slime rollout fields:
    • partitioned: tokens, multimodal_train_inputs, response_lengths, rewards, truncated, loss_masks, round_number, sample_indices, rollout_log_probs, rollout_routed_experts, prompt, teacher_log_probs
    • global: raw_reward, total_lengths

The benchmark includes end-to-end encode/decode/materialization costs in both paths:

  • Ray path: build slime DP shard, ray.put, ray.get, actor-side postprocess.
  • Mooncake path: build slime DP shard, tensor encode, remote put, transfer wrapper, materialization/decode, actor-side postprocess, cleanup.

For 128 MiB and 1 GiB, actual payload size is measured by pickle size. For 16 GiB, actual size is reported from the target size to avoid spending time pickling the huge object just for statistics; this does not change the generated data or transfer path.

For 16 GiB Mooncake, the registered buffer pool/global segment were sized large enough to match online reusable-buffer behavior:

  • MOONCAKE_REGISTERED_BUFFER_POOL_BYTES=20GB
  • MOONCAKE_REGISTERED_BUFFER_POOL_MAX_BUFFER_BYTES=10GB
  • MOONCAKE_GLOBAL_SEGMENT_SIZE=24GB

End-to-end results

Target size Actual size source Backend Put (ms) Get/materialize (ms) E2E (ms) Speedup vs Ray
128 MiB pickle, ~122.34 MiB Ray 563.32 607.22 1170.55 1.00x
128 MiB pickle, ~122.34 MiB Mooncake DataProto 126.08 53.57 179.64 6.52x
1 GiB pickle, ~978.76 MiB Ray 4806.83 4797.96 9604.79 1.00x
1 GiB pickle, ~978.73 MiB Mooncake DataProto 1122.44 413.58 1536.02 6.25x
16 GiB target Ray 92934.14 81919.63 174853.78 1.00x
16 GiB target Mooncake DataProto 21476.33 7651.35 29127.68 6.00x

Fine-grained timing

128 MiB:

Backend Breakdown
Ray shard build 0.81 ms, ray put 565.51 ms, ray get 598.20 ms, postprocess 0.07 ms
Mooncake shard build 1.33 ms, tensor encode 114.13 ms, remote put 21.56 ms, proto wrap 6.35 ms, materialize 67.50 ms, postprocess 0.12 ms, cleanup 0.24 ms

1 GiB:

Backend Breakdown
Ray shard build 8.18 ms, ray put 4749.82 ms, ray get 4539.82 ms, postprocess 0.47 ms
Mooncake shard build 5.67 ms, tensor encode 930.23 ms, remote put 131.55 ms, proto wrap 23.64 ms, materialize 402.63 ms, postprocess 0.49 ms, cleanup 0.24 ms

Observation: in the current slime integration, Mooncake's online remote put/read path is much smaller than Ray object put/get, while Mooncake put-side time is dominated by Python tensor encoding. The low-intrusion phase keeps actor consumption unchanged; exploiting partial consumption/range reads is a follow-up phase.

Documentation

Added English usage documentation:

  • docs/en/advanced/mooncake-dataproto-transfer.md
  • linked from docs/en/index.rst

The doc describes what the feature does and how to use it; detailed design notes and benchmark helper scripts are intentionally not included in this PR.

Testing

  • /snap/bin/ruff check docs/en/advanced/mooncake-dataproto-transfer.md slime/utils/arguments.py slime/utils/data.py slime/utils/remote_batch.py slime/utils/rollout_dataproto.py tests/utils/test_dataproto_transfer.py slime/ray/rollout.py train.py train_async.py
  • git diff HEAD --check
  • PYTHONPATH=/root/slime:/root/Mooncake-ROLL/mooncake-wheel /root/roll/.venv/bin/python -m pytest -q tests/utils/test_dataproto_transfer.py
    • 9 passed

No benchmark scripts or benchmark outputs are included in this PR.

Checklist

  • Mooncake DataProto transfer backend wiring
  • Low-intrusion tensor-batch remote transfer mode
  • Driver-side post-training cleanup lifecycle
  • Basic functional tests
  • Format/lint checks
  • English usage documentation

🤖 Generated with Claude Code

zxpdemonio and others added 2 commits June 25, 2026 14:50
Add an optional mooncake_dataproto transfer backend that publishes rollout tensor fields through Mooncake while preserving slime's existing rollout data layout and Ray default path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Route remote rollout batches through MooncakeBundleTransfer put/get/cleanup DataProto helpers so slime matches the refactored PR2050 interface.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zxpdemonio and others added 2 commits June 25, 2026 19:39
Use Mooncake structured DataProto handles directly for rollout dict transport so slime no longer carries a local DataProto/RemoteBatch wrapper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose the rollout transfer backend as mooncake while keeping mooncake_dataproto as a compatibility alias for existing scripts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants