Add Mooncake Backend for Rollout Data Transfer by zxpdemonio · Pull Request #1709 · THUDM/slime

zxpdemonio · 2026-03-11T08:12:21Z

Summary

This PR adds Mooncake DataProto rollout transfer as an optional transfer backend for slime. The default backend remains Ray; Mooncake is enabled explicitly for disaggregated rollout/training deployments that use Mooncake Store for cross-node data movement.

The implementation is intentionally low-intrusion:

slime keeps its existing rollout dict layout and DP partitioning;
only tensor-heavy rollout fields (tokens, loss_masks) are published as Mooncake remote tensor batches;
non-tensor rollout fields and metadata stay in a lightweight transfer wrapper;
actor/critic-side consumption materializes back into slime's legacy rollout data shape;
Mooncake keys are cleaned up by the driver after the actor/critic train refs complete.

Dependency

This PR depends on Mooncake remote tensor helper support from:

[Store] add remote tensor batch interfaces kvcache-ai/Mooncake#2050

The slime implementation uses Mooncake-side APIs such as RemoteTensorBatch, TensorFieldRef, selected tensor materialization, and the registered buffer pool. Merge or install a Mooncake build containing that PR before enabling --transfer-backend mooncake_dataproto in a real deployment.

Motivation

In multi-node RL training, rollout producers and actor consumers often exchange large rollout batches. Ray object transfer works as the default path, but cross-node transfer can spend significant time serializing and moving large tensor payloads.

Mooncake provides a Store/RDMA path and framework-neutral remote tensor helpers for tensor metadata, selected materialization, and reusable registered buffers. This PR wires slime to those helpers without requiring a slime-specific schema in Mooncake.

Implementation

CLI options

Option	Default	Meaning
`--transfer-backend`	`ray`	Set to `mooncake_dataproto` to enable Mooncake rollout transfer.
`--mooncake-dataproto-store-init-kwargs`	`null`	JSON kwargs used to initialize Mooncake Store. Use `{"setup_method":"setup"}` for real setup and `{"setup_method":"setup_dummy"}` for local tests.
`--mooncake-dataproto-hard-pin`	`true`	Hard-pin remote tensor data to the producer segment when publishing tensor batches.

Key code paths

slime/utils/remote_batch.py
- adds MooncakeRemoteBatch for tensor fields;
- wraps Mooncake RemoteTensorBatch / TensorFieldRef metadata;
- owns Mooncake store setup/cache helpers used by this transfer path;
- materializes via Mooncake registered buffer pool when available;
- removes published tensor keys during cleanup.
slime/utils/rollout_dataproto.py
- keeps a small transfer wrapper for local non-tensor data plus remote tensor metadata;
- splits rollout data by slime's existing DP partitions;
- stores only tokens and loss_masks through Mooncake remote tensor batches;
- converts materialized tensors back to slime's legacy rollout dict shape;
- tracks cleanup metadata and performs driver-side post-training cleanup.
Existing integration points
- slime/ray/rollout.py switches to split_rollout_data_by_dp_dataproto() only when --transfer-backend mooncake_dataproto is enabled;
- slime/utils/data.py materializes Mooncake transfer refs before legacy rollout processing;
- train.py and train_async.py clean up Mooncake keys after the actor/critic training refs have completed.

Data Flow

Default Ray path is unchanged.

Mooncake DataProto path:

rollout_data dict
    │
    ▼
slime DP partition selection
    │
    ▼
lightweight transfer wrapper
    │
    ├── tokens/loss_masks ─► MooncakeRemoteBatch / RemoteTensorBatch metadata
    ├── other fields ──────► local non_tensor_batch
    └── cleanup metadata ──► meta_info
    │
    ▼
actor/critic process_rollout_data()
    │
    ▼
dataproto_to_rollout_data()
    │
    ├── materialize remote tensor batch fields
    └── restore legacy rollout_data dict
    │
    ▼
actor/critic training
    │
    ▼
driver post-training cleanup removes Mooncake keys

Cleanup is intentionally driver-side after training completion rather than consumer-side after materialization, because the same rollout can be consumed by multiple workers and by both critic and actor.

Usage

python -m slime.train \
  --transfer-backend mooncake_dataproto \
  --mooncake-dataproto-store-init-kwargs '{"setup_method":"setup"}' \
  ...

For local unit tests or smoke tests:

python -m slime.train \
  --transfer-backend mooncake_dataproto \
  --mooncake-dataproto-store-init-kwargs '{"setup_method":"setup_dummy"}' \
  ...

Performance

Test method

Benchmark target: compare slime's Ray rollout transfer path with the new mooncake_dataproto path, using slime's real rollout data structure rather than an invented synthetic schema.

Environment:

producer node: 192.168.22.70
consumer node: 192.168.22.72
Ray cluster: 192.168.22.70:6382
Mooncake transport: RDMA
Python env: /root/roll/.venv
Mooncake transfer mode: low-intrusion remote tensor batch mode
warmup: 1 round before measured run
measured rounds shown here: 1 round per size
DP size: 1
sequence length: 2048
generated fields match current slime rollout fields:
- partitioned: tokens, multimodal_train_inputs, response_lengths, rewards, truncated, loss_masks, round_number, sample_indices, rollout_log_probs, rollout_routed_experts, prompt, teacher_log_probs
- global: raw_reward, total_lengths

The benchmark includes end-to-end encode/decode/materialization costs in both paths:

Ray path: build slime DP shard, ray.put, ray.get, actor-side postprocess.
Mooncake path: build slime DP shard, tensor encode, remote put, transfer wrapper, materialization/decode, actor-side postprocess, cleanup.

For 128 MiB and 1 GiB, actual payload size is measured by pickle size. For 16 GiB, actual size is reported from the target size to avoid spending time pickling the huge object just for statistics; this does not change the generated data or transfer path.

For 16 GiB Mooncake, the registered buffer pool/global segment were sized large enough to match online reusable-buffer behavior:

MOONCAKE_REGISTERED_BUFFER_POOL_BYTES=20GB
MOONCAKE_REGISTERED_BUFFER_POOL_MAX_BUFFER_BYTES=10GB
MOONCAKE_GLOBAL_SEGMENT_SIZE=24GB

End-to-end results

Target size	Actual size source	Backend	Put (ms)	Get/materialize (ms)	E2E (ms)	Speedup vs Ray
128 MiB	pickle, ~122.34 MiB	Ray	563.32	607.22	1170.55	1.00x
128 MiB	pickle, ~122.34 MiB	Mooncake DataProto	126.08	53.57	179.64	6.52x
1 GiB	pickle, ~978.76 MiB	Ray	4806.83	4797.96	9604.79	1.00x
1 GiB	pickle, ~978.73 MiB	Mooncake DataProto	1122.44	413.58	1536.02	6.25x
16 GiB	target	Ray	92934.14	81919.63	174853.78	1.00x
16 GiB	target	Mooncake DataProto	21476.33	7651.35	29127.68	6.00x

Fine-grained timing

128 MiB:

Backend	Breakdown
Ray	shard build 0.81 ms, ray put 565.51 ms, ray get 598.20 ms, postprocess 0.07 ms
Mooncake	shard build 1.33 ms, tensor encode 114.13 ms, remote put 21.56 ms, proto wrap 6.35 ms, materialize 67.50 ms, postprocess 0.12 ms, cleanup 0.24 ms

1 GiB:

Backend	Breakdown
Ray	shard build 8.18 ms, ray put 4749.82 ms, ray get 4539.82 ms, postprocess 0.47 ms
Mooncake	shard build 5.67 ms, tensor encode 930.23 ms, remote put 131.55 ms, proto wrap 23.64 ms, materialize 402.63 ms, postprocess 0.49 ms, cleanup 0.24 ms

Observation: in the current slime integration, Mooncake's online remote put/read path is much smaller than Ray object put/get, while Mooncake put-side time is dominated by Python tensor encoding. The low-intrusion phase keeps actor consumption unchanged; exploiting partial consumption/range reads is a follow-up phase.

Documentation

Added English usage documentation:

docs/en/advanced/mooncake-dataproto-transfer.md
linked from docs/en/index.rst

The doc describes what the feature does and how to use it; detailed design notes and benchmark helper scripts are intentionally not included in this PR.

Testing

/snap/bin/ruff check docs/en/advanced/mooncake-dataproto-transfer.md slime/utils/arguments.py slime/utils/data.py slime/utils/remote_batch.py slime/utils/rollout_dataproto.py tests/utils/test_dataproto_transfer.py slime/ray/rollout.py train.py train_async.py
git diff HEAD --check
PYTHONPATH=/root/slime:/root/Mooncake-ROLL/mooncake-wheel /root/roll/.venv/bin/python -m pytest -q tests/utils/test_dataproto_transfer.py
- 9 passed

No benchmark scripts or benchmark outputs are included in this PR.

Checklist

Mooncake DataProto transfer backend wiring
Low-intrusion tensor-batch remote transfer mode
Driver-side post-training cleanup lifecycle
Basic functional tests
Format/lint checks
English usage documentation

🤖 Generated with Claude Code

Add an optional mooncake_dataproto transfer backend that publishes rollout tensor fields through Mooncake while preserving slime's existing rollout data layout and Ray default path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Route remote rollout batches through MooncakeBundleTransfer put/get/cleanup DataProto helpers so slime matches the refactored PR2050 interface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use Mooncake structured DataProto handles directly for rollout dict transport so slime no longer carries a local DataProto/RemoteBatch wrapper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Expose the rollout transfer backend as mooncake while keeping mooncake_dataproto as a compatibility alias for existing scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio force-pushed the mooncake branch from 8b49deb to 8b46fd5 Compare March 11, 2026 08:49

stmatengss mentioned this pull request Mar 12, 2026

[RoadMap][Call For Contributions] Mooncake Store V3 Roadmap kvcache-ai/Mooncake#1035

Open

47 tasks

lilei199908 added the run-ci-megatron label Mar 12, 2026

zxpdemonio force-pushed the mooncake branch 2 times, most recently from ab886da to a81a18d Compare May 11, 2026 15:41

zxpdemonio and others added 2 commits June 25, 2026 14:50

feat: add Mooncake DataProto rollout transfer

4ee0078

Add an optional mooncake_dataproto transfer backend that publishes rollout tensor fields through Mooncake while preserving slime's existing rollout data layout and Ray default path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Align Mooncake rollout transfer with structured DataProto API

5e4072f

Route remote rollout batches through MooncakeBundleTransfer put/get/cleanup DataProto helpers so slime matches the refactored PR2050 interface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio closed this Jun 25, 2026

zxpdemonio force-pushed the mooncake branch from a81a18d to bf9b1a3 Compare June 25, 2026 09:04

zxpdemonio reopened this Jun 25, 2026

zxpdemonio and others added 2 commits June 25, 2026 19:39

Align Mooncake rollout transfer with DataProto handles

a86be28

Use Mooncake structured DataProto handles directly for rollout dict transport so slime no longer carries a local DataProto/RemoteBatch wrapper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename Mooncake rollout transfer backend

175dad6

Expose the rollout transfer backend as mooncake while keeping mooncake_dataproto as a compatibility alias for existing scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Mooncake Backend for Rollout Data Transfer#1709

Add Mooncake Backend for Rollout Data Transfer#1709
zxpdemonio wants to merge 4 commits into
THUDM:mainfrom
zxpdemonio:mooncake

zxpdemonio commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zxpdemonio commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependency

Motivation

Implementation

CLI options

Key code paths

Data Flow

Usage

Performance

Test method

End-to-end results

Fine-grained timing

Documentation

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zxpdemonio commented Mar 11, 2026 •

edited

Loading