Add Mooncake Backend for Rollout Data Transfer#1709
Open
zxpdemonio wants to merge 4 commits into
Open
Conversation
47 tasks
ab886da to
a81a18d
Compare
Add an optional mooncake_dataproto transfer backend that publishes rollout tensor fields through Mooncake while preserving slime's existing rollout data layout and Ray default path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Route remote rollout batches through MooncakeBundleTransfer put/get/cleanup DataProto helpers so slime matches the refactored PR2050 interface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use Mooncake structured DataProto handles directly for rollout dict transport so slime no longer carries a local DataProto/RemoteBatch wrapper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose the rollout transfer backend as mooncake while keeping mooncake_dataproto as a compatibility alias for existing scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds Mooncake DataProto rollout transfer as an optional transfer backend for slime. The default backend remains Ray; Mooncake is enabled explicitly for disaggregated rollout/training deployments that use Mooncake Store for cross-node data movement.
The implementation is intentionally low-intrusion:
tokens,loss_masks) are published as Mooncake remote tensor batches;Dependency
This PR depends on Mooncake remote tensor helper support from:
The slime implementation uses Mooncake-side APIs such as
RemoteTensorBatch,TensorFieldRef, selected tensor materialization, and the registered buffer pool. Merge or install a Mooncake build containing that PR before enabling--transfer-backend mooncake_dataprotoin a real deployment.Motivation
In multi-node RL training, rollout producers and actor consumers often exchange large rollout batches. Ray object transfer works as the default path, but cross-node transfer can spend significant time serializing and moving large tensor payloads.
Mooncake provides a Store/RDMA path and framework-neutral remote tensor helpers for tensor metadata, selected materialization, and reusable registered buffers. This PR wires slime to those helpers without requiring a slime-specific schema in Mooncake.
Implementation
CLI options
--transfer-backendraymooncake_dataprototo enable Mooncake rollout transfer.--mooncake-dataproto-store-init-kwargsnull{"setup_method":"setup"}for real setup and{"setup_method":"setup_dummy"}for local tests.--mooncake-dataproto-hard-pintrueKey code paths
slime/utils/remote_batch.pyMooncakeRemoteBatchfor tensor fields;RemoteTensorBatch/TensorFieldRefmetadata;slime/utils/rollout_dataproto.pytokensandloss_masksthrough Mooncake remote tensor batches;Existing integration points
slime/ray/rollout.pyswitches tosplit_rollout_data_by_dp_dataproto()only when--transfer-backend mooncake_dataprotois enabled;slime/utils/data.pymaterializes Mooncake transfer refs before legacy rollout processing;train.pyandtrain_async.pyclean up Mooncake keys after the actor/critic training refs have completed.Data Flow
Default Ray path is unchanged.
Mooncake DataProto path:
Cleanup is intentionally driver-side after training completion rather than consumer-side after materialization, because the same rollout can be consumed by multiple workers and by both critic and actor.
Usage
python -m slime.train \ --transfer-backend mooncake_dataproto \ --mooncake-dataproto-store-init-kwargs '{"setup_method":"setup"}' \ ...For local unit tests or smoke tests:
python -m slime.train \ --transfer-backend mooncake_dataproto \ --mooncake-dataproto-store-init-kwargs '{"setup_method":"setup_dummy"}' \ ...Performance
Test method
Benchmark target: compare slime's Ray rollout transfer path with the new
mooncake_dataprotopath, using slime's real rollout data structure rather than an invented synthetic schema.Environment:
192.168.22.70192.168.22.72192.168.22.70:6382/root/roll/.venvtokens,multimodal_train_inputs,response_lengths,rewards,truncated,loss_masks,round_number,sample_indices,rollout_log_probs,rollout_routed_experts,prompt,teacher_log_probsraw_reward,total_lengthsThe benchmark includes end-to-end encode/decode/materialization costs in both paths:
ray.put,ray.get, actor-side postprocess.For 128 MiB and 1 GiB, actual payload size is measured by pickle size. For 16 GiB, actual size is reported from the target size to avoid spending time pickling the huge object just for statistics; this does not change the generated data or transfer path.
For 16 GiB Mooncake, the registered buffer pool/global segment were sized large enough to match online reusable-buffer behavior:
MOONCAKE_REGISTERED_BUFFER_POOL_BYTES=20GBMOONCAKE_REGISTERED_BUFFER_POOL_MAX_BUFFER_BYTES=10GBMOONCAKE_GLOBAL_SEGMENT_SIZE=24GBEnd-to-end results
Fine-grained timing
128 MiB:
1 GiB:
Observation: in the current slime integration, Mooncake's online remote put/read path is much smaller than Ray object put/get, while Mooncake put-side time is dominated by Python tensor encoding. The low-intrusion phase keeps actor consumption unchanged; exploiting partial consumption/range reads is a follow-up phase.
Documentation
Added English usage documentation:
docs/en/advanced/mooncake-dataproto-transfer.mddocs/en/index.rstThe doc describes what the feature does and how to use it; detailed design notes and benchmark helper scripts are intentionally not included in this PR.
Testing
/snap/bin/ruff check docs/en/advanced/mooncake-dataproto-transfer.md slime/utils/arguments.py slime/utils/data.py slime/utils/remote_batch.py slime/utils/rollout_dataproto.py tests/utils/test_dataproto_transfer.py slime/ray/rollout.py train.py train_async.pygit diff HEAD --checkPYTHONPATH=/root/slime:/root/Mooncake-ROLL/mooncake-wheel /root/roll/.venv/bin/python -m pytest -q tests/utils/test_dataproto_transfer.py9 passedNo benchmark scripts or benchmark outputs are included in this PR.
Checklist
🤖 Generated with Claude Code